AI & ML Business Data Innovation Research Security

Try the O’Reilly learning platform

With the O’Reilly learning platform, you get the resources and guidance to keep your skills sharp and stay ahead. Try it free for up to 14 days.

Start trial

Try a course for free

Join a live online event on the O’Reilly platform to learn from the experts shaping tech.

See what’s coming soon

Get the Radar Trends newsletter

Your email

Country

Please read our privacy policy.

Content > Topics

Fast track Apache Spark

6 lessons learned to get a quick start on productivity.

By Tobi Bosede September 12, 2017 • 6 minute read

LinkedIn X Facebook Threads Bluesky Reddit

Wrong Way Go Back (source: Johnny Jet on Flickr)

My upcoming Strata Data NYC 2017 talk about big data analysis of futures trades is based on research done under the limited funding conditions of academia. This meant that I did not have an infrastructure team, therefore I had to set up a Spark environment myself. I was analyzing futures order books from the Chicago Mercantile Exchange (CME) spanning May 2, 2016, to November 18, 2016. The CME data included extended hours trading with the following fields: instrument name, maturity, date, time stamp, price, and quantity. Futures were comprised of 21 financial instruments spanning six markets—foreign exchange, metal, energy, index, bond, and agriculture. Trades were recorded roughly every half second. In the process of doing this research, I learned a lot of lessons. I want to help you avoid making the mistakes I did so you can start making an immediate impact in your organization with Spark. Here are the six lessons I learned:

You don’t need a database or data warehouse. It is common for Spark setups to use Apache Hadoop’s distributed file system (HDFS) and Hive for querying, but you can use text files and other accepted file formats in local directories if you don’t want to go through the hassle of setting up a database or warehouse. When I worked at Sprint, they had an on-premise cluster setup and stored data in HDFS, which provided me great exposure to learn new technologies. However, in my research with the CME, I had already been reading CSVs directly into Spark using the spark-csv package, which has since been merged into the main Spark project because of the fundamental capability the package provides.
You don’t need a cluster of machines. You can hit the ground running using your local machine or a single server. This is also helpful in that you do not have to consider what cluster manager to install—YARN or Mesos. You can simply use the standalone cluster manager that comes with Spark. Just make sure that, if you use one machine, it has multiple cores and enough memory to cache your data. In general, any time you build a distributed system you should first start with one machine. One of my mentors once told me that software engineering is like math. To build something, it is often useful to start at n=0, as in an inductive proof, and then generalize from there. That is one analogy that I can relate to!
Use a notebook. Don’t bother trying to configure an IDE or using the shell to write applications. Of course, using the shell is great if you are submitting an application or doing some basic coding. I used the shell when I was learning Spark to run some of the examples that come with Spark and follow along with tutorials. And, in theory, having all the features of an IDE might be a reflexive thing many programmers want for any coding they do. However, the pain of configuring an IDE with Spark is not worth the investment. A friend of mine once told me he had spent time struggling to set up an IDE for programming in Spark in vain. If only I had met him before then. I would have been able to share that all the programming I did in Spark for my research was either in the shell or a Jupyter Notebook. The latter being used the majority of the time, so an IDE was completely unnecessary. Although, every once in awhile, I would go old school and just use vi, a command line editor, to code.
Don’t know Scala? Start learning Spark in the language you do know – whether it be Java, Python, or R. In Spark versions 2.0+, a lot of additional support was added for R, namely in the form of SparkR and sparklyr. A lot of people are intimidated by the Spark learning curve, but language does not have to be a barrier to entry. One of the Spark core developers’ mantras is to democratize data science so it is not just for the engineer, but for the scientist and the analyst. Making APIs available in so many languages is a deliberate effort on this front.
Use DataFrames instead of resilient distributed data sets (RDDs) for ease of use. There has been extensive work done between Spark versions 1.6 and 2.0 to make functionality that has traditionally been done using RDDs easier to do with DataFrames. Also, DataFrames are efficient across languages. So, if you’re more comfortable with Java, Python, or R, there is no performance loss suffered by not switching to Scala. (This reinforces lesson #4 above.) I am on the Spark users’ mailing list and have suggested various improvements to the community. In general, I try to keep abreast of latest developments in the Spark community not only through the mailing list, but also by reading online documentation and published research papers.
Avoid partial actions. A partial action is an action that does not allow the directed acyclic graph (DAG) for a Spark job to be fully evaluated. Thus, transformations along the way do not complete. For example, one issue that plagued me for some time was why computation was slow on data I cached. Eventually, I realized by looking at the web UI that only some of the data was cached. However, I still did not know why all the data was not being cached. In fact, I knew that since cache() was a transformation, I had to call an action before it would be evaluated. The code I was running was data=df.cache().show(). However, because show() is defaulted to return 20 rows, and my data had more than 20 rows, show() was a partial action. I should have been running data=df.cache().count() to cache the entire data set, as count() will always act as a complete action. So, be wary of partial actions. Thanks to Holden Karau for helping me troubleshoot.

These tips highlight Spark’s ability to deliver serious gains in productivity despite limited user computing capability. There is definitely an ideal Spark setup for each organization’s particular needs. One or all of the following will most likely be necessary once there is buy-in from stakeholders to create a robust analytics system: expanding to a cluster setup, building a data warehouse, and utilizing an infrastructure team. My hope is that this post has given you some tips to make it easier to create a proof-of-concept with Spark that justifies stakeholder investment, and that it has provided some pointers if you decide that a bare bones Spark setup is best for you.

Post topics: Data science

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills

Try the O’Reilly learning platform

Try a course for free

Get the Radar Trends newsletter

Thank you for subscribing to the O’Reilly Radar Trends to Watch newsletter.

Fast track Apache Spark