5IoT’s Data Processing Using Spark

Ankita Bansal^* and Aditya Atri

¹Netaji Subhas University of Technology, Delhi, India

²Netaji Subhas Institute of Technology, Delhi, India

Abstract

Large volume of structured and unstructured data known as Big Data requires efficient frameworks and software techniques for processing because they cannot be processed using traditional database methods. One well-known system for Big Data processing is Spark. MapReduce technology of the Hadoop was used for batch processing embedded in cluster computing. In order to help Hadoop work faster, the Spark was introduced. Spark has its own processing engine which uses distributed file storage of Hadoop and cloud storage of data. Spark’s API conforms to the type of data and its associated processing required. Spark also provides functionalities and tools for processing of queries, graphs, and machine learning algorithms. Spark SQL is very important and used for processing of queries included in the framework of Spark and hence maintaining the storage of large datasets on the cloud. Spark also performs operations on the input data taken from various different data sources. In order to maintain and create data frames, in-built functions are used by Spark.

Keywords: RDD, DataFrames, datasets, spark SQL, SQLContext, hive tables, JSON, parquet files, data sources, hadoop, MapReduce, cloud, big data, spark, cluster computing, spark API

5.1 Introduction

In this chapter, we will start with the basics of three ...

Get The Smart Cyber Ecosystem for Sustainable Development now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.

Start your free trial

The Smart Cyber Ecosystem for Sustainable Development by Pardeep Kumar, Vishal Jain, Vasaki Ponnusamy

5IoT’s Data Processing Using Spark

5.1 Introduction

Don’t leave empty-handed

It’s yours, free.

Check it out now on O’Reilly