In this chapter, we’ll talk about an open source dataset called common crawl which is available on AWS’s registry of open data (https://registry.opendata.aws/).
AWS hosts a large variety of open datasets on its servers which are freely available to all users. These datasets are uploaded and maintained by third parties, and AWS simply waives off the monthly charges and/or server fees to support these organizations.
The Common Crawl Foundation (https://commoncrawl.org/) is a 501(c)(3) nonprofit involved in providing open access web crawl data ...