O'Reilly logo

Breaking Data Science Open by Christine Doig, Michele Chambers, Ian Stokes-Rees

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Chapter 4. Open Data Science Applications: Case Studies

Open Data Science has brought the ingredients of data science—data, analytics, and computation—within everyone’s reach. This is fueling a new generation of intelligent applications that solve previously intractable problems and facilitate innovative discoveries. Here are a few case studies of Continuum Analytics’ clients that showcase the power of Open Data Science.

Recursion Pharmaceuticals

This biotech startup found that the enormous size and complex interactions inherent in genomic material made it hard for biologists to find relationships that might predict diseases or optimize treatment. Through a sophisticated combination of analytics and visualization, Recursion’s data scientists produced heat maps that compared diseased samples to healthy genetic material, highlighting differences. The biologists not only can identify disease markers more accurately and quickly, but can also run intelligent simulations that apply up to thousands of potential drug remedies to diseased cells to identify treatments.

This has greatly accelerated the treatment discovery process. Fueled by Open Data Science, Recursion Pharmaceuticals has been able to find treatments for rare genetic diseases—specifically, unanticipated uses for drugs already developed by their client pharmaceutical companies. The benefits to patients are incalculable, because treatments for rare diseases don’t provide the revenue potential to justify costly drug development. Furthermore, small samples of patients mean that conventional randomized drug trials can’t produce statistically significant results and therefore the drugs might otherwise not be approved for sale.


The Open Source Policy Center (OSPC) was formed to “open-source the government” by creating transparency around the models used to formulate policies. Until now, those models have been locked up in proprietary software. The OSPC created an open source community seeded by academics and economists. Using Open Data Science, this community translated the private economic models that sit behind policy decisions and made them publicly available as open source software. Citizen data scientists and journalists can access these today through the OSPC TaxBrain web interface, allowing anyone to predict the economic impact of tax policy changes.

Having represented the tax code in a calculable form, this team can now ask questions such as: what will be the result of increasing or decreasing a rate? How about a deduction? By putting their work on the web, the team allows anyone with sufficient knowledge to ask such questions and get instant results. People concerned with taxes (and who isn’t?) can immediately show the effects of a change, instead of depending on the assurances of the Treasury Department or a handful of think-tank experts. This is not only an Open Data Science project, but an open data project (drawing from published laws) and an open source software project (the code was released on GitHub).

TaxBrain is a powerful departure from the typical data science project, where a team of data scientists creates models that are surfaced to end users via reports. Instead, TaxBrain was developed by subject matter experts who easily picked up Python and created powerful economic models that simulate the complexities of US tax code to predict future policy outcomes in an interactive visual interface.

Lawrence Berkeley National Laboratory/University of Hamburg

In academia, scientists often collaborate on their research, and this is true of the physicists at the University of Hamburg. As with many scientists today, they fill a role as data scientists. Their research is quantified with data, and the reproducibility of their results is important for effective dissemination.

Vying for time on one of the world’s most advanced plasma accelerators is highly competitive. The University of Hamburg group’s research must be innovative and prove that their time on the accelerator will produce novel results that push the frontiers of scientific knowledge.

To this end, particle physicists from Lawrence Berkeley National Laboratory (LBNL) and the University of Hamburg worked together to create a new algorithm and approach, using cylindrical geometry, which they embedded in a simulator to identify the best experiments to run on the plasma accelerator. Even though the scientists are on separate continents, they were able to easily collaborate using Open Data Science tools, boosting their development productivity and allowing them to scale out complex simulations across a 128 GPU cluster, which resulted in a 50 percent speedup in performance. This cutting-edge simulation optimized their time on the plasma accelerator, allowing them to zero in on the most innovative research quickly.

As more businesses and researchers try to rapidly unlock the value of their data in modern architectures, Open Data Science becomes essential to their strategy.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, interactive tutorials, and more.

Start Free Trial

No credit card required