Data analysis

Download OnlineRetail.csv from the link provided with the book. Then, you can load the file using Pandas.

The following is a simple way of reading a local file using Pandas:

import pandas as pdpath = '/Users/sridharalla/Documents/OnlineRetail.csv'df = pd.read_csv(path)

However, since we are analyzing data in a Hadoop cluster, we should be using hdfs not a local system. The following is an example of how the hdfs file can be loaded into a pandas DataFrame:

import pandas as pdfrom hdfs import InsecureClientclient_hdfs = InsecureClient('http://localhost:9870')with client_hdfs.read('/user/normal/OnlineRetail.csv', encoding = 'utf-8') as reader: df = pd.read_csv(reader,index_col=0)

The following is what the following line of code ...

Get Big Data Analytics with Hadoop 3 now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.