What is Jupyter, and why do you care? After all, Jupyter has never become a buzzword like data science, artificial intelligence, or Web 2.0. Unlike those big abstractions, Jupyter is very concrete. It’s an open source project, a piece of software, that does specific things.
But without attracting the hype, Jupyter Notebooks are revolutionizing the way engineers and data scientists work together. If all important work is collaborative, the most important tools we have are tools for collaboration, tools that make working together more productive.
That's what Jupyter is, in a nutshell: it's a tool for collaborating. It’s built for writing and sharing code and text, within the context of a web page. The code runs on a server, and the results are turned into HTML and incorporated into the page you're writing. That server can be anywhere: on your laptop, behind your firewall, or on the public internet. Your page contains your thoughts, your code, and the results of running the code.
Code is never just code. It's part of a thought process, an argument, even an experiment. This is particularly true for data analysis, but it's true for almost any application. Jupyter lets you build a "lab notebook" that shows your work: the code, the data, the results, along with your explanation and reasoning. As IBM puts it, Jupyter lets you build a "computational narrative that distills data into insights." Data means nothing if you can't turn it into insight, if you can't explore it, share it, and discuss it. Data analysis means little if you can't explore and experiment with someone else's results. Jupyter is a tool for exploring, sharing, and discussing.
A notebook is easily shareable. You can save it and send it as an attachment, so someone else can open the notebook with Jupyter. You can put the notebook in a GitHub repository and let others read it there; GitHub automatically renders the notebook to a static web page. GitHub users can download (clone) their own copy of the notebook and any supporting files so they can expand on your work: they can inspect the results, modify the code, and see what happens. It's a lot easier to maintain an up-to-date archive on GitHub than to hand distribute your code, data, supporting files, and results. You can go further by using container technology, such as Docker, to package your notebook, a notebook server, any libraries you need, your data, and a stripped-down operating system, into a single downloadable object.
Sharing can be as public as you want. You can run a Jupyter server on your laptop, largely inaccessible to anyone else. You can run a multi-user Jupyter server, JupyterHub, behind your corporate firewall. You can even push Jupyter Notebooks into the cloud. GitHub and GitLab (a host-it-yourself git server) automatically convert notebooks into static HTML for access over the web, and platforms like Binder allow others to run your code in the cloud. They can experiment with it and modify it, all within the context of a private instance.
I've said that sharing becomes even easier when Jupyter is combined with Docker. One of the biggest problems facing developers in any programming language is installing the software and libraries you need to run someone else's code. Version incompatibilities and operating system incompatibilities make your life painful; it can literally take days just to install the software needed to run a complex project. That pain can be eliminated by combining Jupyter with Docker. Docker lets you build a container that includes everything needed to run your notebook. So, when you share the container, which is as simple as sharing a link, you're not just sharing your project: you're sharing all the dependencies needed to run that project, in a form that's known to work.
When you combine Jupyter with containers and a source management system like GitHub, you get a platform for collaboration: on coding, on data analysis, on visualization, on anything that can be done in most programming languages.
The Jupyter architecture
While it's not important to understand Jupyter's internals, it is important to understand what it lets you build. It's not just a tool: it's a platform, an ecosystem, that enables others to build tools on top of it.
Jupyter is built from three parts:
- The Jupyter server, which is either a relatively simple application that runs on your laptop, or a multi-user server. The Jupyter project’s JupyterHub is the most widely used multi-user server for Jupyter.
- The kernel protocol, which allows the server to offload the task of running code to a language-specific kernel. Jupyter ships with kernels for Python 2 and Python 3, but kernels for many other languages are available.
This architecture, though simple, is very flexible. You can substitute your own front end, as nteract has done: its main responsibility is managing documents. You can build a front end that implements real-time dashboards; you can use the Jupyter protocol to implement support for other languages; you can implement custom servers to create new media types. O’Reilly Media’s Orioles combine Jupyter Notebooks with a parallel streaming video narrative, synchronized to the notebook.
The Jupyter workflow
Git is a version control system: it's used for tracking different versions of software, and recording the difference between versions. It allows you to roll back to an earlier version; it also allows code sharing, and it allows multiple people to work on the same codebase and resolve conflicts between their changes. Jupyter Notebooks are just alphanumeric data structures: they look like code, and Git has no problem working with them.
Docker is a tool for automating application deployment. It allows you to "shrink wrap" everything that's needed for an application to run: Jupyter itself, the notebook, all the libraries, and any other tools (data, etc.) needed to run the application—even a stripped-down operating system (typically Linux). One of the most painful parts of sharing code in any significantly popular programming language is resolving conflicts between libraries, the programming language, the operating system, etc. If you've tried to install someone else's software, you've probably experienced version hell: their project requires database X, but X needs library Y version 1.9, and you have version 1.8 installed. When you try to get 1.9, you find that it won't build. To build version 1.9 of library Y, you need version 3.4 of library Z, but Z has never been ported to your operating system. And so on. Docker eliminates the problem: instead of delivering your code by itself, you deliver the entire package—everything that's needed in the runtime environment. You start the container, the container starts its own operating system, which starts Jupyter, and everything just works.
I won't describe how to use Git and Docker in detail; they're tools that could be simpler, and a number of organizations (including O'Reilly) are working on tools to simplify integrating Jupyter with Git and Docker. With Git and Docker in the picture, the workflow looks like this:
- Use Git locally (or use an external service, like GitHub): whenever you reach a significant point in your work, commit the results to the Git repository. You'll now be able to revert to the current version should you need to.
- Keep a Dockerfile in your repository, along with your notebooks. Use the Dockerfile to record everything you need to run the notebooks: libraries, data, utilities. There are pre-built Docker images that contain most of what you need, for a number of common environments, so, in practice, you don't have to modify the Dockerfile much.
- Run the Jupyter server inside the Docker container. That keeps everything clean and isolated.
- You can push your Docker image to a registry, such as DockerHub. At that point, other users can pull your image, build a container that will match yours, and run your code without worrying about version hell.
The Jupyter workflow requires some discipline, but it's worth it. The Jupyter project maintains a collection of Dockerfiles for many common configurations: Python with numeric libraries for data analysis, Python and Scala with Apache Spark (a general purpose engine for distributed computation with data), Python with R, and many others. Using these containers eliminates installation pain completely; all you need to do is install Docker, and give a command to start and build the container. You can download and start a container with a single command.
Jupyter at work and school
IBM recently published a case study describing work they did for the Executive Transportation Group, a black car service operating in New York City. Data analysts used Jupyter to analyze data about rides and drivers, using Apache Spark to distribute the computation across many computers. They used Spark’s distributed computing capabilities via the Toree kernel; Toree and Spark allowed them to process tens of millions of geographic lookups in a timely way.
To create the ETG project, IBM contributed to, and took advantage of, several extensions to Jupyter—a give-and-take relationship that is only possible with an open source project. The team used Jupyter interactive (“declarative”) widgets to build dashboards that allowed them to communicate results with staff from ETG. Interactive widgets let developers provide the kinds of controls you'd expect for graphical applications: sliders, buttons, and other web components. The dashboard extension makes it possible to build with complex layouts, rather than the linear top-to-bottom layout that notebooks give you by default.
The IBM and ETG teams were able to iterate rapidly as they refined their analytic tools: they could deploy the dashboard as a web application, collect feedback and questions from their users, modify the application, and iterate. Jupyter enabled an agile process for analyzing the data and building the tools that ETG needed.
Lorena Barba, professor of mechanical engineering at George Washington University, is a leader in using Jupyter in teaching. She calls Jupyter Notebooks “computable content,” and calls them “a killer app for STEM education,” because notebooks make it possible to share material directly with students. It’s not just written on the blackboard; it’s shared in a way that allows students to interact directly, and it can be combined with text, links, images, videos. You don’t learn to code through lectures; you learn by interacting and experimenting.
The new Foundations of Data Science course at UC Berkeley, required of all undergraduates, demonstrates this approach at scale. Thousands of students receive assignments, access to large data sets, and instructions for completing the assignments as notebooks. All the code runs on JupyterHub in the cloud, eliminating the problem of software installation. According to the instructors, the course “wouldn’t be possible without Jupyter Notebooks, which enable browser-based computation, avoiding the need for students to install software, transfer files, or update libraries.” Extensions to Jupyter, which will be incorporated into future releases, support real-time interaction between students and teachers: questions, answers, help on assignments, all in the context of the actual code the student is writing.
We’ve just talked about the widget and dashboard extensions. There are also widgets for more advanced tasks, like creating maps based on OpenStreetMap, and doing interactive data visualization in 2D and 3D. There's also an extension that "bridges" Jupyter and d3.js, the most advanced library for building data-driven web documents.
The Jupyter ecosystem also includes tools for publishing your documents in different ways. For example, nbviewer is a simple tool that allows non-programmers to view Jupyter Notebooks. It doesn’t run the code or allow modifications; it just renders the “finished product” as a web page. Nbviewer can be installed locally; there is also a public nbviewer service, which can render any notebook that’s available online. All you need is the URL.
Nbviewer is based on on nbconvert, which converts notebooks into many different static formats, including HTML, LaTeX, PDF, scripts (just the code, as an executable script, without the rest of the notebook), and slides.
While there’s no single source listing all of Jupyter’s extensions, widgets, and tools, there is a lively ecosystem of developers working on building features for the Jupyter platform.
JupyterLab and the future
JupyterLab is the next important change in Jupyter's universe. The JupyterLab Computational Environment rethinks Jupyter as an IDE, an integrated development environment, for working with software.
Much of what's in JupyterLab is already built into Jupyter; the JupyterLab project is really about taking features that are already baked in and exposing them so they can be used more flexibly. There's a file manager; a graphical console; a terminal window for monitoring the system; an advanced text editor; and, of course, an integrated help system. What's new is that these features are exposed in new ways: it's easier to build dashboards, to access the tooling needed to create and debug more complex applications.
The Zero to JupyterHub project makes it much easier to run JupyterHub in the cloud: specifically, Google Compute Engine and Microsoft Azure (with more to come). Running JupyterHub in the cloud means you can make notebooks accessible to a very broad audience, without worrying about computing resources. Zero to JupyterHub uses the Kubernetes, Helm, and Docker projects to manage the use of services in the cloud, and to provide standard and robust computing environments.
The Jupyter project is working toward real-time collaboration in notebooks: allowing multiple users to edit a notebook simultaneously. We’re used to dynamic collaboration on Google Docs and other online platforms. Why not Jupyter? There are extensions that allow notebooks to be hosted on Google Drive; we’re looking to see collaboration baked directly into JupyterHub, so that it’s available anywhere in workgroup and enterprise deployments.
Jupyter has become a standard for scientific research and data analysis. It packages computation and argument together, letting you build “computational narratives”; it allows you to publish your narratives in many formats, from live online notebooks to slide decks; it supports many popular programming languages; and it simplifies the problem of distributing working software to teammates and associates. There are many tools, ranging from traditional IDEs to analytics platforms, that solve one or two of these problems. Jupyter is the only one that solves them all. To succeed in digital transformation, businesses need to adopt tools that have been proven: tools that enable collaboration, sharing, and rapid deployment. That’s what Jupyter is about.