Chapter 7. The Open Data Science Landscape

Now that you have a sense of what Open Data Science is, and how to prepare culturally and organizationally for it, it’s time to talk about the technology tools available in the Open Data Science world.

The Open Data Science community has grown to match and outperform traditional commercial tools for statistics and predictive analytics. In fact, Open Data Science tools are rapidly becoming the tools of choice for data scientists, with Python and R as the primary languages. They are part of an incredibly rich ecosystem with innumerable additional resources in the open source world that go well beyond the capabilities offered by commercial software. In the Big Data space, it is clear that Open Data Science technologies such as Hadoop, Spark, MongoDB, and Elastic Search are frequently preferred over commercial alternatives—and not simply due to the price differential, but because they offer the most powerful and capable enterprise-ready technology today for the problems they address.

Because of the scale and self-managed/anarchical structure of open source communities, the Open Data Science community can also seem chaotic from the outside. That’s why many organizations have adopted open source distributions backed by companies that provide compatibility and enterprise guarantees. You need to bring order into that chaos, so you can leverage the diverse array of languages, packages, and tools within your company, and have them work in your ...

Get Breaking Data Science Open now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.