J. M. PatelGetting Structured Data from the Internethttps://doi.org/10.1007/978-1-4842-6576-5_6

6. Introduction to Common Crawl Datasets

Jay M. Patel¹

(1)

Specrom Analytics, Ahmedabad, India

In this chapter, we’ll talk about an open source dataset called common crawl which is available on AWS’s registry of open data (https://registry.opendata.aws/).

AWS hosts a large variety of open datasets on its servers which are freely available to all users. These datasets are uploaded and maintained by third parties, and AWS simply waives off the monthly charges and/or server fees to support these organizations.

The Common Crawl Foundation (https://commoncrawl.org/) is a 501(c)(3) nonprofit involved in providing open access web crawl data ...

Get Getting Structured Data from the Internet: Running Web Crawlers/Scrapers on a Big Data Production Scale now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.

Start your free trial

Getting Structured Data from the Internet: Running Web Crawlers/Scrapers on a Big Data Production Scale by Jay M. Patel

6. Introduction to Common Crawl Datasets

Don’t leave empty-handed

It’s yours, free.

Check it out now on O’Reilly