Chapter 4. Using Spark Streaming to Manage Sensor Data

Editor’s Note: At Strata + Hadoop World in New York, in September 2015, Hari Shreedharan (Software Engineer at Cloudera) and Anand Iyer (Senior Product Manager at Cloudera) presented this talk, which applies Spark Streaming architecture to IoT use cases, demonstrating how you can manage large volumes of sensor data.

Spark Streaming takes a continuous stream of data and represents it as an abstraction, called a discretized stream. This is commonly referred to as a DStream. A DStream takes the continuous stream of data and breaks it up into disjoint chunks called microbatches. The data that fits within a microbatch—essentially the data that streamed in within the time slot of that microbatch—is converted to a resilient distributed dataset (RDD). Spark then processes that RDD with regular RDD operations.

Spark Streaming has seen tremendous adoption over the past year, and is now used for a wide variety of use cases. Here, we’ll focus on the application of Spark Streaming to a specific use case—proactive maintenance and accident prevention in railways.

To begin, let’s keep in mind that the IoT is all about sensors—sensors that are continuously producing data, with all of that data streaming into your data center. In our use case, we fitted sensors to railway locomotives and railway carriages. We wanted to resolve two different issues from the sensor data: (a) identifying when there is damage to the axle or wheels of the railway locomotive or railway carriages; and (b) identifying damage on the rail tracks.

The primary goal in our work was to prevent derailments, which result in the loss of both lives and property. Though railway travel is one of the safest forms of travel, any loss of lives and property is preventable.

Another goal was to lower costs. If you can identify issues early, then you can fix them early; and in almost all cases, fixing issues early costs you less.

The sensors placed on the railway carriages are continuously sending data, and there is a unique ID that represents each sensor. There’s also a unique ID that represents each locomotive. We want to know how fast the train was going and the temperature, because invariably, if something goes wrong, the metal heats up. In addition, we want to measure pressure—because when there’s a problem, there may be excessive weight on the locomotive or some other form of pressure that’s preventing the smooth rotation of the wheels.

The sound of the regular hum of an engine or the regular rhythmic spinning of metal wheels on metal tracks is very different from the sound that’s produced when something goes wrong—that’s why acoustic signals are also useful. Additionally, GPS coordinates are necessary so that we know where the trains are located as the signals stream in. Last, we want a timestamp to know when all of these measurements are taken. As we capture all of this data, we’re able to monitor the readings to see when they increase from the baseline and get progressively worse—that’s how we know if there is damage to the axle or wheels.

Now, what about damage to the rail tracks? Damage on a railway track occurs at a specific location. With railway tracks you have a left and right track, and damage is likely on one side of the track, not both. When a wheel goes over a damaged area, the sensor associated with that wheel will see a spike in readings. And the readings are likely to be acoustic noise, because you’ll have the metal clanging sound, as well as pressure. Temperature may not come into play as much because there probably needs to be a sustained period of damage in order to affect this reading. So in the case of a damaged track, acoustic noise and pressure readings are likely to go up, but it will be a spike. The minute the wheel passes that damaged area, the readings will come back down—and that’s our cue for damage on a railway track.

Architectural Considerations

In our example, all of these sensor readings have to go from the locomotive to the data center. The first thing we do when the data arrives is write it to a reliable, high-throughput streaming channel, or streaming transportation layer—in this case, we use Kafka. With the data in Kafka, we can read it in Spark Streaming, using the direct Kafka connector.

The first thing we do when these events come into the data center is enrich them with relevant metadata, to help determine if there is potential damage. For example, based on the locomotive ID, we want to fetch information about the locomotive, such as the type—for example, we would want to know if it’s a freight train, if it’s carrying human passengers, how heavy it is, and so on. And if it is a freight train, is it carrying hazardous chemicals? If that’s the case, we would probably need to take action at any hint of damage. If it’s a freight train that’s just coming back empty, with no cargo, then it’s likely to be less critical. For these reasons, information about the locomotive is critical.

Similarly, information about each sensor is critical. You want to know where the sensor is on the train (i.e., is it on the left wheel or the right wheel?). GPS information is also important because if the train happens to be traveling on a steep incline, you might expect temperature readings to go up. The Spark HBase model, which is now a part of the HBase code base, is what we recommend for pulling in this data.

After you’ve enriched these events with all the relevant metadata, the next task in our example is to determine whether a signal indicates damage—either through a simple rule-based or predictive model. Once you’ve identified a potential problem, you write an event to a Kafka queue. You’ll have an application that’s continuously listening to alerts in the queue, and when it sees an event, the application will send out a physical alert (i.e., a pager alert, an email alert, or a phone call) notifying a technician that something’s wrong.

One practical concern here is with regard to data storage—it’s helpful to dump all of the raw data into HDFS, for two reasons. First, keeping the raw data allows data scientists to play with the data, and possibly uncover new insights. Second, there will likely be bugs in your application, and in your code, and you’ll want to do an audit when things go wrong. Having the raw data in HDFS lets you write simple batch jobs to figure out when things are wrong, either in your application logic, or in certain cases, where the sensors might have gone wrong.

Visualizing Time-Series Data

Once a technician knows that there’s a potential problem, it’s time to diagnose the issue. In order to diagnose the issue, the technician will have to look at readings from the sensors as time-series data—over different windows of time. Being able to visualize when readings occurred is enormously helpful; Grafana is one open source tool for doing this, and you can always build something quickly using JavaScript. Once the technician has diagnosed the issue—depending on what the problem is—he can either specify that the train be sent for regular maintenance, or that it be stopped because it is carrying passengers or hazardous chemicals.

The Importance of Sliding Windows

Sliding windows are critical in Spark Streaming. You always want to specify a time period on which you want to apply your operation. Rather than writing a custom code variant for looking into each piece of data, to query whether anything happened in the last five hours or last five minutes, you can implement a windowed structure.

A window DStream basically has a window interval, which is the window in which you want to look at all of your previous events. You also have a sliding interval that you can keep moving forward. When you apply an operation, you apply it to individual windows. Instead of applying operations on individual RDDs, you apply the operation on all RDDs that arrive within a specified window. You can have this window as any multiple of your microbatch interval. You don’t want these windows to be long, because most of this data is either cached in memory or written to local disk. If your window’s size becomes, say, more than 24 hours, and you’re getting a million events per hour, then you’re going to see a lot of data being stashed in memory or written to disk, and your performance is going to suffer.

There is an API called updateStateByKey that is very useful. Given a key, you can apply any random operation on the previous value with new information that you received over the last n minutes. So you can combine windowing and updateStateByKey to apply these operations for windows. You would take your incoming data, put it into a window DStream with a specified window, and then apply the state transformations using updateStateByKey.

Checkpoints for Fault Tolerance

One of the most important things about updateStateByKey or windowing is that you always want to enable checkpointing. Checkpoints are used primarily for fault tolerance.

Think about it: what happens if an RDD goes missing from memory? If you’ve used 60% or 70% of your memory, Spark will drop the RDDs. At that point, you want to reconstruct those RDDs.

The Spark idea of failure tolerance is to get the original data and apply the series of transformations that led to that RDD in the first place. The problem is that it has to apply a large number of operations on the original data. If this original data is from several weeks ago, you could possibly use up all of your stack by just applying operations. You could end up with a stack overflow and your operation would never complete. In that case, you want to truncate that chain of events, that chain of transformations, over the last n days, and pick up the latest (as late as possible) value of the keys.

Spark will checkpoint the state of that RDD at any point in time, and do a persistent storage like HDFS. So when you have an RDD that has a long chain of events, but has a checkpoint, Spark will simply recover from the checkpoint rather than trying to apply all of the operations. So checkpointing will save your application from either long-chain processing or huge stack overflow errors. And because a checkpoint is in the state of the application when it died, you recover from where the failure happened.

It’s fairly simple to write an application that restarts from a checkpoint. Instead of just creating a new streaming context, you apply a function to create a new streaming context. And that function looks at the checkpoint. If the checkpoint is there, it reads from that instead of reading directly—creating a new Spark Streaming context directly.

Start Your Application from the Checkpoint

Checkpoints are terrific for fault tolerance or restarting your applications, but they’re not good for upgrades. Checkpoints in Spark are Java-serialized. Anyone who has ever used Java serialization knows that if you upgrade your application, you change your code, you change your classes, and your serialization is then useless.

The problem is that your checkpoint had all of your data. If you change your application, suddenly all that data has disappeared—not good. Most users want an application that does checkpointing, but they also want to upgrade their applications.

In that case, what do we do? How do you upgrade a checkpoint? The challenge is that your data would need to be separated from your code. Because checkpoints are serialized classes, your data is now tied into your code.

The answer is pretty simple if you think about it. Your application has its own data. You know what you want to keep track of; it’s usually some RDD that has been generated from your operations. It is in some state that you generated from updateStateByKey, and it’s usually the last offsets from Kafka that were reliably processed and written out to HDFS. So if you know what you want to process and save, why not do it separately?

Enable checkpointing in your application so that the truncation of your chain happens all the time, but don’t use Spark Streaming context. Instead, get a create method to start your application from the checkpoint. When you start your application, you start off fresh—don’t use the get or create. Instead, read the state that you wrote out, and then apply your operations from that point on.

Get Analyzing Data in the Internet of Things now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.