August 2019
Beginner
482 pages
12h 56m
English
As we mentioned already, there are plenty of issues with this data, as web pages are very different in terms of their structure and offer different sets of information, formatted differently. There are a lot of issues in the code – cleaning all of it will take another chapter (and indeed, that's what we'll do in Chapter 11, Data Cleaning and Manipulation). It is good practice, however, to perform a modicum of basic quality control, verifying that all the pages have some minimal, requisite properties, and that they are not null. We could also add some other checks, ensuring, for example, that the additional fields are not empty, at least for a significant number of the pages.
The approach we'll be using is two-fold. First, ...