Gap

To demonstrate using a Sitemap to investigate content, we will use the Gap website.

Gap has a well structured website with a Sitemap to help web crawlers locate their updated content. If we use the techniques from Chapter 1, Introduction to Web Scraping, to investigate a website, we would find their robots.txt file at http://www.gap.com/robots.txt, which contains a link to this Sitemap:

Sitemap: http://www.gap.com/products/sitemap_index.xml 

Here are the contents of the linked Sitemap file:

<?xml version="1.0" encoding="UTF-8"?> <sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">     <sitemap>         <loc>http://www.gap.com/products/sitemap_1.xml</loc>         <lastmod>2017-03-24</lastmod>     </sitemap>     <sitemap>  <loc>http://www.gap.com/products/sitemap_2.xml</loc> ...

Get Python Web Scraping - Second Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.