Preprocessing the data for visualization

Before jumping into the visualizations, we will do some preparatory work on the data harvested:

In [16]: # Read harvested data stored in csv in a Panda DF import pandas as pd csv_in = '/home/an/spark/spark-1.5.0-bin-hadoop2.6/examples/AN_Spark/data/unq_tweetstxt.csv' pddf_in = pd.read_csv(csv_in, index_col=None, header=0, sep=';', encoding='utf-8') In [20]: print('tweets pandas dataframe - count:', pddf_in.count()) print('tweets pandas dataframe - shape:', pddf_in.shape) print('tweets pandas dataframe - colns:', pddf_in.columns) ('tweets pandas dataframe - count:', Unnamed: 0 7540 id 7540 created_at 7540 user_id 7540 user_name 7538 tweet_text 7540 dtype: int64) ('tweets pandas dataframe - shape:', (7540, ...

Get Spark for Python Developers now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.