Julia’s Role in Data Science

Myths and Realities

By John White

October 23, 2013

White polka dots on black cotton fabric (source: Wikimedia Commons)

Introduction

Since its first public release in February 2012, the Julia programming language has received a lot of hype. This has led to some confusion about the language’s current status. In this post, I’d like to make clear where Julia stands and where Julia is going, especially in regard to Julia’s role in data science, where the dominant languages are R and Python. We’re working hard to make Julia a viable alternative to those languages, but it’s important to separate out myth from reality.

Where Julia Stands

In order to the dispel some of the confusion about Julia, I want to discuss the two main types of misunderstandings that I come across:

Learn faster. Dig deeper. See farther.

Join the O'Reilly online learning platform. Get a free trial today and find answers on the fly, or master something new and useful.

Learn more

Confusion 1: Julia already possesses a mature package ecosystem and can be used as a feature-complete replacement for R or Python.
Confusion 2: Julia’s compiler is so good that it will make any piece of code fast – even bad code.

The truth about Julia is closer to the following:

Reality 1: Julia has a quickly growing, but still very young, package ecosystem. If you want to be productive, Julia needs to be part of a multilanguage environment in which you use R or Python when they are more appropriate. How much of your work can be done using only Julia depends upon your specific needs. People who tend to construct novel models and fit them using optimization algorithms will find that Julia is already nearly feature-complete. People who depend upon R’s large collection of classical statistical procedures will find that Julia is still missing a lot of functionality.
Reality 2: Julia’s compiler can produce code nearly as efficient as similar C code if the Julia code given to the compiler is written with performance in mind. What sets apart Julia is not a sufficiently smart compiler that works around sloppy code, but rather a combination of (1) a strong type system that aligns naturally with primitive machine operations and (2) a fully automatic type inference system, which makes it possible for the compiler to do the tedious work of type declaration when the user does not want to do it for themself.

Julia’s Ecosystem is Growing, but Young

Julia was publicly released ~1.5 years ago, after 2 years of internal development. Although the Julia community has grown substantially since February 2012, the language ecosystem is still very young. Many of the popular libraries available for languages like R or Python have no parallels in Julia yet. Julia is slowly developing its own ecosystem of packages, but any practical data scientist will need to mix Julia code with R or Python when the problem at hand demands it.

That said, it’s worth noting that the Julia ecosystem already has some very impressive packages:

Base Julia provides much of the functionality available in NumPy.
Additional Julia packages are slowly filling in the functionality of SciPy, including Stats.jl, Distributions.jl, Optim.jl and JuMP.jl.
DataFrames.jl provides tools for working with tabular data that will be familiar to users of R or pandas.
Gadfly.jl provides a bare-bones visualization package similar in spirit to ggplot2, while PyPlot.jl provides a complete interface to matplotlib from Julia.
Graphs.jl provides some of the functionality from packages like igraph or NetworkX.

When these packages do not meet users’ needs, most Julia users will get their work done using one of two strategies:

Direct language interop: PyCall.jl makes it possible to call Python code from inside of a Julia program. The Julia community is already using these interop facilities to build packages like SymPy.jl, which wraps a popular symbolic algebra system developed for Python. Similarly, Matlab.jl makes it possible to call Matlab from Julia.
Multistep pipelines: Many data science tasks can be divided into a pipeline of completely independent steps. Newcomers to Julia can transition a pipeline over to Julia in steps, which eases the transition. When I first started using Julia, I would frequently do data preprocessing and modeling in Julia, but all of the subsequent visualization steps in R. As Julia’s package ecosystem matures, more parts of a pipeline can be translated into Julia code.

Julia’s Compiler isn’t Magic: The Language Design Is

Julia has acquired a well-deserved reputation for speed. Microbenchmarks demonstrate that well-written Julia performs nearly as well as similar C code. But, unlike a language like Javascript, Julia achieves its high level of performance through the systematic use of machine-appropriate types and data structures, rather than through the application of a compiler with the sophistication of Javascript’s v8 engine.

To see what makes Julia’s approach special, consider a method definition in Julia like the simple line, double(x) = x + x. In Julia, this method definition actually defines a potentially infinite family of functions: one for each of the possible types of input that might be passed as an argument to the function. For example, double(2) will call a specialized function definition that uses a CPU’s native integer addition instruction, whereas double(2.0) will use the CPU’s native floating point addition instruction. Julia’s ability to generate specialized code for different input types, when coupled with the compiler’s ability to infer these types for most variables, makes it possible to write Julia code at a very abstract level while achieving the efficiency associated with low level code that would work on only a small subset of machine primitives. Julia’s ability to compile code that reads like Python into machine code that performs like C almost entirely derives from Julia’s ability to specialize function definitions in this way.

While Julia’s compiler is able to exploit type inference to generate very efficient code in many cases, it’s important to keep in mind that Julia’s compiler isn’t doing anything magical. Code that can be interpreted in terms of simple operations on basic machine types will be as fast as careful C code, but code that doesn’t let the compiler do its tricks won’t be faster than code written in languages like R or Python.

In addition to the fact that Julia’s compiler can’t make arbitrary code fast, it’s important to keep in mind that many of the built-in functions in R or Python aren’t written in those languages, but in C. Because Julia performs roughly as well as C, this means that Julia won’t do better than R or Python if most of the work you do in R or Python is calling built-in functions without performing any explicit iteration or recursion. It’s only when you start doing custom work that Julia will really shine.

In other words, Julia is the perfect language for advanced users of R or Python, who are trying to build advanced tools inside of those languages. The alternative to Julia is typically resorting to C: R offers this through Rcpp and Python offers it through Cython. The goal of Julia is to make it possible to get Cython-like performance in the exact same language as you build your prototype in.

Where Julia is Going

In the next year, we’ll be working to push Julia forward in several different directions. First and foremost, we’ll be trying to improve on Julia’s graphical toolkits, so that binary installations of Julia ship with a high quality set of graphical functions that users can use to visualize data. We expect that Julia will be able to rival the toolkits from more established languages within another year or two.

We’ll also be working to make the integration between Julia and Python much tighter, which should make it much easier for advanced Python users to implement performance bottlemarks in Julia much as they currently might use Cython.

Finally, we’ll improve the quality of our data infrastructure and modeling tools. More and more of the statistical functionality from R will be ported to Julia. At the same time, interfaces to Python libraries like scikit-learn will grow. Eventually newcomers will be able to expect that most data science tasks can be done in Julia as easily as they can now be done in Python or R.

In addition to this work, the basic Julia language will continue its gradual evolution, including the introduction of better tools for parallel processing and the development of a static compiler that will generate machine executables.

Although Julia’s core language is still evolving, it’s worth noting that the basic language design has been stable for several years now. Unlike the evolving package system, the basic Julia language is quite stable. Users who are interested in experimenting with the language should find that it is already ready to handle many standard tasks. We hope you take Julia out for a spin and find it as enjoyable to work with as we do.

Related Resources

Post topics: Data