Python data tools just keep getting better
A variety of tools are making data science tasks easy to do in Python
Here are a few observations inspired by conversations I had during the just concluded PyData conference.
The Python data community is well-organized:
Besides conferences (PyData, SciPy, EuroSciPy), there is a new non-profit (NumFOCUS) dedicated to supporting scientific computing and data analytics projects. The list of supported projects are currently Python-based, but in principle NumFOCUS is an entity that can be used to support related efforts from other communities.
It’s getting easier to use the Python data stack:
There are tools that facilitate the dissemination and sharing of code and programming environments. IPython notebooks allow Python code and markup in the same document. Notebooks are used to record and share complex workflows and are used heavily for (conference) tutorials. As the data stack grows, one of the major pain points is getting all the packages to work properly together (version compatibility is a common issue). In particular setting up environments were all the pieces work together can be a pain. There are now a few solutions that address this issue: Anaconda and cloud-based Wakari from Continuum Analytics, and cloud computing platform PiCloud.
There are many more visualization tools to choose from:
The 2D plotting tool matplotlib is the first tool enthusiasts turn to, but as I learned at the conference, there are a number of other options available. Continuum Analytics recently introduced companion packages Bokeh and Bokeh.js that simplify the creation of static and interactive visualizations using Python. In particular Bokeh is the equivalent of ggplot (it even has an interface that mimics ggplot). With Nodebox, programmers use Python code to create sketches and interactive visualizations that are similar to those produced by Processing.
Large-scale data processing and wrangling tools have improved:
Pandas and PyTables are already popular, and there was very strong interest in the forthcoming Blaze project at the conference. Other options include the Disco Project, a data processing platform that includes an implementation of Map/Reduce, and PySpark, the Python API for the Spark data analytics framework.
There are viable tools for large-scale data analytics:
Scikit-learn (machine-learning library) and scikit-image (image processing) are used by many academic research groups and companies. Both have extensive libraries of algorithms, and come with lots of examples to help users get started. Another tool written in Python focuses on deployment: Augustus is an open source system for building and scoring, scalable data mining and statistical algorithms. Augustus produces and consumes PMML, and includes components for simple data wrangling (users can embed Python code for data processing in their PMML files).
In addition, new tools like H20 and wise.io plan to make their massively scalable algorithms accessible via Python. Frameworks that expose distributed algorithms to Python programmers include GraphLab (Python/Jython interface) and Spark (algorithms in Scala that are accessed via PySpark). Finally, there are also tools that let Python programmers target GPU’s for parallel programming: NumbaPro and PyCUDA