Big Data Simplified

Introducing Spark andKafka | 137

valemp_country = hiveContext.sql(“select distinct(ctry) from

empSpark”).collect.foreach(println)

//to display ﬁve records from the table ‘author_hive’ from ‘sqoopdb’

valemp_country = hiveContext.sql(“select * from sqoopdb.author_hive

limit 5”).collect.foreach(println)

//to display the number of total records from the table ‘author_hive’

from ‘sqoopdb’

valemp_country = hiveContext.sql(“select count(*) from sqoopdb.author_

hive”).collect.foreach(println)

6.1.5 Spark Libraries: Streaming

The Spark Streaming library is for streaming data. It is a very popular library as it takes

Spark’s big data processing power and extends it to ‘fast data’. Spark Streaming has proven its

ability for streaming gigabytes per second (Ref. Figure 6.10). Both these combined abilities

of ‘Big Data’ and ‘fast data’, has huge potential ranging from real-time fraud detection to mar-

keting, which is relevant to a customer now, instead of focusing on the customer’s intention

from last week.

M06 Big Data Simplified XXXX 01.indd 137 5/17/2019 2:49:14 PM

Get Big Data Simplified now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.

Big Data Simplified by Sayan Goswami, Amit Kumar Das, Sourabh Mukherjee