As a typical example of big data analysis, we will use some textual data from the internet, and we will take advantage of the available fetch_20newsgroups, which contains data of 11,314 posts, each one averaging about 206 words, which appeared in 20 different newsgroups:
In: import numpy as np from sklearn.datasets import fetch_20newsgroups newsgroups_dataset = fetch_20newsgroups(shuffle=True, remove=('headers', 'footers', 'quotes'), random_state=6) print ('Posts inside the data: %s' % np.shape(newsgroups_dataset.data)) print ('Average number of words for post: %0.0f' % np.mean([len(text.split(' ')) for text in newsgroups_dataset.data]))Out: Posts inside the data: 11314 Average number of words for ...