• Sometimes, you have to address bad data when sensors are aected in a few devices.
• It is possible that you have to clean data and filter out sensitive data, which cannot be
saved in a few selected governance zones.
• This chapter can help you to understand how StreamSets Data Collector can help in the
simplification of data movement.
• Regardless of the fact which type of data the sensors and devices generate, usually the
backend architecture is the same. Such architecture can have the following characteristics.
• A publisher/subscriber mechanism or MQTT message brokers like Kaa, RabbitMQ, or
Solace Systems is used for the incoming data.
• A StreamSets Data Collector pipeline is used for routing, cleaning, and enriching data.
• A Hadoop cluster that can help to improve the analysis and processing of the data.
The StreamSets Data Collector oers a drag-and-drop user interface for designing, testing, and
operating the pipelines for data flow. This system is created to support continuous processes
and accepts data from several streaming points like the Apache Kaa, RabbitMQ, and Cloudera
Kaa. Built-in transformation processors allow implementing sanitization methods for merg-
ing, masking, hashing, splitting, lookups, and parsing. The list of processors is continuously
increasing. If you want to use your own customized logic then you can use Jython, JavaScript or
Groovy processors. Lastly, you can use an API to generate Java-based processor stages.
The pipeline of the StreamSets Data Collector uses the memory to execute each transfor-
mation and performs ordered set delivering via the delivery semantics known as “At Least Once
or At Most Once”. The IDE supports DevOps and can help you with the generation, testing, and
running of the pipelines so you can convert your streaming internet of things data and use it for
a dataset that is ready for consumption. This data can now be used for visualization or analysis.
During execution mode, you can take advantage of high runtime visibility that can help
you to assess your data flows. This includes error rates, processing time, and throughput for
all the pipeline stages. You can also create rules and alerts on the basis of a threshold to handle
scenarios when the rate of processing slows down or when you have to deal with anomalous
data values.
Problems with IoT Streaming Ingestion
Among the common problems that you can find in large-scale internet of things deployment,
there are those with data seeping issues because of the aged devices. There is also the dilemma
of multiple device versions that are distributed over the installed base. You have to also consider
the requirement to improve the data before it is sent to the data store. Lastly, you have to plan
how the system can handle several sensor streams.
Managing Bad Data
As IoT deals with thousands of devices, therefore, there will be a time when you find a device
with poor calibration or one which has become defective. These issues have to be solved before
they are sent to the data store.
StreamSets Data Collector can discover the issue and present insights for the errors along
with the exact error record. Additionally, it can also display a stack-trace for the failure con-
dition. All of this is processed without aecting the primary pipeline. The error records can
be stored on the disk or a secondary pipeline that is connected to Elasticsearch or Kaa for
remediation.
90 Internet of Things
Internet_of_Things_CH04_pp081-104.indd 90 9/3/2019 10:13:31 AM
Get Internet of Things now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.