Julia’s Role in Data Science
Myths and Realities
Myths and Realities
Since its first public release in February 2012, the Julia programming language has received a lot of hype. This has led to some confusion about the language’s current status. In this post, I’d like to make clear where Julia stands and where Julia is going, especially in regard to Julia’s role in data science, where the dominant languages are R and Python. We’re working hard to make Julia a viable alternative to those languages, but it’s important to separate out myth from reality.
In order to the dispel some of the confusion about Julia, I want to discuss the two main types of misunderstandings that I come across:
The truth about Julia is closer to the following:
Julia’s Ecosystem is Growing, but Young
Julia was publicly released ~1.5 years ago, after 2 years of internal development. Although the Julia community has grown substantially since February 2012, the language ecosystem is still very young. Many of the popular libraries available for languages like R or Python have no parallels in Julia yet. Julia is slowly developing its own ecosystem of packages, but any practical data scientist will need to mix Julia code with R or Python when the problem at hand demands it.
That said, it’s worth noting that the Julia ecosystem already has some very impressive packages:
When these packages do not meet users’ needs, most Julia users will get their work done using one of two strategies:
Julia’s Compiler isn’t Magic: The Language Design Is
To see what makes Julia’s approach special, consider a method definition in Julia like the simple line,
double(x) = x + x. In Julia, this method definition actually defines a potentially infinite family of functions: one for each of the possible types of input that might be passed as an argument to the function. For example,
double(2) will call a specialized function definition that uses a CPU’s native integer addition instruction, whereas
double(2.0) will use the CPU’s native floating point addition instruction. Julia’s ability to generate specialized code for different input types, when coupled with the compiler’s ability to infer these types for most variables, makes it possible to write Julia code at a very abstract level while achieving the efficiency associated with low level code that would work on only a small subset of machine primitives. Julia’s ability to compile code that reads like Python into machine code that performs like C almost entirely derives from Julia’s ability to specialize function definitions in this way.
While Julia’s compiler is able to exploit type inference to generate very efficient code in many cases, it’s important to keep in mind that Julia’s compiler isn’t doing anything magical. Code that can be interpreted in terms of simple operations on basic machine types will be as fast as careful C code, but code that doesn’t let the compiler do its tricks won’t be faster than code written in languages like R or Python.
In addition to the fact that Julia’s compiler can’t make arbitrary code fast, it’s important to keep in mind that many of the built-in functions in R or Python aren’t written in those languages, but in C. Because Julia performs roughly as well as C, this means that Julia won’t do better than R or Python if most of the work you do in R or Python is calling built-in functions without performing any explicit iteration or recursion. It’s only when you start doing custom work that Julia will really shine.
In other words, Julia is the perfect language for advanced users of R or Python, who are trying to build advanced tools inside of those languages. The alternative to Julia is typically resorting to C: R offers this through Rcpp and Python offers it through Cython. The goal of Julia is to make it possible to get Cython-like performance in the exact same language as you build your prototype in.
In the next year, we’ll be working to push Julia forward in several different directions. First and foremost, we’ll be trying to improve on Julia’s graphical toolkits, so that binary installations of Julia ship with a high quality set of graphical functions that users can use to visualize data. We expect that Julia will be able to rival the toolkits from more established languages within another year or two.
We’ll also be working to make the integration between Julia and Python much tighter, which should make it much easier for advanced Python users to implement performance bottlemarks in Julia much as they currently might use Cython.
Finally, we’ll improve the quality of our data infrastructure and modeling tools. More and more of the statistical functionality from R will be ported to Julia. At the same time, interfaces to Python libraries like scikit-learn will grow. Eventually newcomers will be able to expect that most data science tasks can be done in Julia as easily as they can now be done in Python or R.
In addition to this work, the basic Julia language will continue its gradual evolution, including the introduction of better tools for parallel processing and the development of a static compiler that will generate machine executables.
Although Julia’s core language is still evolving, it’s worth noting that the basic language design has been stable for several years now. Unlike the evolving package system, the basic Julia language is quite stable. Users who are interested in experimenting with the language should find that it is already ready to handle many standard tasks. We hope you take Julia out for a spin and find it as enjoyable to work with as we do.