Chapter 2

Exploring Big Data with Python

IN THIS CHAPTER

Bullet Using NumPy for data science

Bullet Using pandas for fast data analysis

Bullet Our first data science project

Bullet Visualization with MatPlotLib in Python

In this chapter we get into some of the tools and processes used by data scientists to format, process, and query their data.

There are a number of Python-based tools and libraries (such as “R”) available, but we decided to use NumPy for three reasons. First, it is one of the two most popular tools to use for data science in Python. Second, many AI-oriented projects use NumPy (such as the one in our last chapter). And third, the highly useful Python data science package, Pandas, is built on NumPy.

Pandas is turning out to be a very important package in data science. The way it encapsulates data in a more abstract way makes it easier to manipulate, document, and understand the transformations you make in the base datasets.

Finally, MatPlotLib is a good visualization package for the results of big data. It’s very Python-centric, but it suffers from a steep learning curve to get going. However, ...

Get Python All-in-One For Dummies now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.