Creating some big datasets as examples

As a typical example of big data analysis, we will use some textual data from the internet, and we will take advantage of the available fetch_20newsgroups, which contains data of 11,314 posts, each one averaging about 206 words, which appeared in 20 different newsgroups:

In: import numpy as np    from sklearn.datasets import fetch_20newsgroups    newsgroups_dataset = fetch_20newsgroups(shuffle=True,                          remove=('headers', 'footers', 'quotes'),                          random_state=6)    print ('Posts inside the data: %s' % np.shape(newsgroups_dataset.data))    print ('Average number of words for post: %0.0f' %            np.mean([len(text.split(' ')) for text in            newsgroups_dataset.data]))Out: Posts inside the data: 11314     Average number of words for ...

Get Python Data Science Essentials - Third Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.