(Apologies to Ray Kurzweil for the title puns)
Recent one-day events showcased the Jupyter community in Boston and Atlanta, with another Jupyter Pop-up event coming on May 15 in Washington, D.C. At the same time, Project Jupyter has been in the news. We’re finding overlap between the themes explored at these community events and recent articles written about Jupyter. That overlap, in turn, illustrates the kinds of dialog that we’re looking forward to at JupyterCon this August.
In the news, notably there was the James Somers article, “The Scientific Paper Is Obsolete”, in The Atlantic, and a subsequent piece, “Jupyter, Mathematica, and the Future of the Research Paper”, by Paul Romer, former chief economist at the World Bank. Both articles compare and contrast between Wolfram Research’s Mathematica and Project Jupyter. On the surface these two approaches both implement notebooks, with excellent examples coming from both communities. However, Paul Romer nailed the contrast between them with a one-liner: “The tie-breaker is social, not technical. The more I learn about the open source community, the more I trust its members.”
Under the surface, the parallels end. Mathematica, which came first, is a popular commercial software product. Jupyter is an open standard for a suite of network protocols that support remote execution environments—plus a spectrum of open source software projects that build extensible environments atop, such as JupyterLab, JupyterHub, Binder, etc. Organizations leverage Jupyter as a foundation for shared data infrastructure at scale. Organizational challenges emerge along with those implementations at scale: collaboration, discovery, security, compliance, privacy, ethics, provenance, etc. Through this open, community-centered approach, we get open standards, open source implementations, and open discussions about best practices for shared concerns.
For common threads between the two, James Somers’ distillation is subtle: “Software is a dynamic medium; paper isn’t.” It’s been 27 years since the public debut of the World Wide Web, though we’re still barely scratching the surface of what that invention made possible. Frankly, an overwhelming amount of “digital paper” persists on the web. While the promise of WWW implies dynamic, interactive media shared across global infrastructure, questions linger about how best to make it happen. Some of those questions have also been in the news recently.
Rolling the clock back a few decades further, one gem on my bookshelf is Computer Lib/Dream Machines, by Ted Nelson, first published in 1974. Nelson explored hypertext, which he’d been working to implement since 1963—though, arguably, that notion traces back to Vannevar Bush and Jorge Luis Borges in the 1940s. To capture the essence of hypertext, Computer Lib also introduced the concept of "intertwingularity": complex interrelations within human knowledge. Nelson’s vision had documents representing the world’s knowledge, documents which could interact and intermingle. Borges prefigured a poetic glimpse of this in his 1941 short story, El jardín de senderos que se bifurcan: the legend of Ts’ui Pên constructing an infinite labyrinth, in which all would lose their way, along with a WWII espionage drama unfolding around that legend.
Out of the many neologisms and one-liners that have attempted to describe Jupyter, intertwingularity nails it. One may “perform science” by authoring a research paper in a journal. That’s science with a lowercase “s,” on paper or something approximating it—merey navigating a single corner of Ts’ui Pên’s labyrinth. Ted Nelson’s vision, however, had documents interacting, intermingling. The practice of reproducible science, which is rapidly unfolding around Jupyter, also relies on documents interacting and intermingling. That opens the door to software as a dynamic medium, "Science" with an uppercase “S.” Not merely a library of “digital paper,” but an entirely new way of collaborating, extending our understanding. Potentially as a map through the entire labyrinth.
Reproducible science via Jupyter finds immediate applications in many places. Certainly there are the “hard sciences”: at JupyterCon, we’ll have session talks ranging across astrophysics, quantum chemistry, genomics, geospatial analysis, climatology, and scientific computing in general. During the Jupyter Day Atlanta event, one excellent example was “Classification and Characterization of Metal Powder in Additive Manufacturing using Convolutional Neural Networks,” by Anna Smith from CMU.
Beyond research, reproducible science is vital for any organization that depends on analysis—and that forms Jupyter’s direct link to data science. During the Jupyter Pop-up Boston event, Dave Stuart presented “Citizen Data Science campaign,” about an open source project called nbgallery, which thousands of DoD analysts use to discover and share Jupyter notebooks. While some teams have computational needs in common, they may not be allowed to share data. Similar data privacy concerns are encountered in finance, health care, social media, etc. The DoD project provides a fascinating approach to discovery (search, recommendations) for interactive content in highly regulated enterprise environments.
In Atlanta, two industry use cases addressed similar needs: Peter Parente from Valassis Digital with “Give a Little Bit of Your Notebooks to Me”—also about sharing and discovering notebook content across an enterprise organization—and John Patanian from General Electric with “Achieving Reproducible and Deployable Data Science Workflows,” about using templates for reproducible workflows.
Similar efforts are changing the classroom. In Boston, we had Allen Downey, Taylor Martin, and Doug Blank join the “Jupyter in Education” panel. In particular, reproducible science via Jupyter notebooks helps instructors manage the scaffolding needed to make course materials more engaging, more immediately hands-on, to give learners confidence and direct experience. Ryan Cooper from UConn presented “Flipping the classroom with Jupyter and GitHub” as a case study for this. In Atlanta, Carol Willing guided us through several excellent examples in “STEAM Workshops with Binder and JupyterHub.”
At a higher level of abstraction, reproducible science has an impact on computer science. In Boston, David Koop and Colin Brown from UMassD presented “Supporting Reproducibility in Jupyter through Dataflows.” Also, see a related project called Nodebook at Stitch Fix by Kevin Zielnicki. By default, cells in a Jupyter Notebook run from top to bottom—although, a person needs to “Run All” to be sure that results are correct. The Dataflows and Nodebook projects track inputs and outputs for each cell so that notebooks can be guaranteed to “rerun” successfully. The UMassD project also allows for rearranging cell order: for example, while you may need a long list of Python imports to initialize a notebook, why not move that cell to the end, so that the initial part of a notebook can jump directly into core code? On the one hand, that supports better scaffolding. On the other hand, these projects represent Jupyter Notebooks as dependency graphs, with pre- and post-conditions for each cell. That’s only a few steps away from Petri nets and other automata used for formal analysis of computer programs, concurrency, business process, reliability engineering, security audits, etc. An imaginable next step could be to leverage machine learning to start generating unit tests—and for code gen in general.
Here’s an intertwingled idea that weaves together most of the above. Generations of modern science have brought us to a point where reproducible science becomes a priority. Collaboration at a global scale can’t proceed further without it. Meanwhile, open source software, since roughly 1998, has similarly evolved to support collaboration at a global scale, leading to standard practices such as versioning (e.g., git), testing, documentation, pull requests, etc. Most of those practices support reusability. Adding some DevOps, continuous integration/continuous deployment is the software analogy for reproducible science.
We’re at a point where those two cultures, science and open source, have much to learn from each other. Science must learn to reuse and improve common software tools, while software must embrace reproducible science. Both are necessary for collaboration at scale. The nexus for that intermingling is Jupyter, where (and when) humans move beyond using digital mimics of print media to take better advantage of what software and collaboration promise in the long term.
Join us at Jupyter Pop-up D.C. on Tuesday, May 15, 2018, at the GWU Marvin Center, from 9:00 a.m. to 5:00 p.m. We’ll have a mix of talks from government, industry, and education about Jupyter, along with a lot of opportunities for networking. It’s a great preview for what’s to come at JupyterCon, August 21-24, 2018, in New York City.