O'Reilly logo

Spark for Python Developers by Amit Nandi

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Preprocessing the data for visualization

Before jumping into the visualizations, we will do some preparatory work on the data harvested:

In [16]: # Read harvested data stored in csv in a Panda DF import pandas as pd csv_in = '/home/an/spark/spark-1.5.0-bin-hadoop2.6/examples/AN_Spark/data/unq_tweetstxt.csv' pddf_in = pd.read_csv(csv_in, index_col=None, header=0, sep=';', encoding='utf-8') In [20]: print('tweets pandas dataframe - count:', pddf_in.count()) print('tweets pandas dataframe - shape:', pddf_in.shape) print('tweets pandas dataframe - colns:', pddf_in.columns) ('tweets pandas dataframe - count:', Unnamed: 0 7540 id 7540 created_at 7540 user_id 7540 user_name 7538 tweet_text 7540 dtype: int64) ('tweets pandas dataframe - shape:', (7540, ...

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, interactive tutorials, and more.

Start Free Trial

No credit card required