Chapter 56. Spidering an FTP Site

Gerard Lanois

This article is the result of my own personal adventures in maintaining a rapidly growing web site via FTP, without the benefit of a telnet shell on my server. If you have FTP access to your web server’s file tree, there are four reasons why mirroring with FTP instead of HTTP might be a better choice:

  1. Your ISP’s web server munges links and image paths in your HTML pages, so you can’t use HTTP to mirror the site.

  2. There is a cache between your HTTP client and your web server, making you retrieve out-of-date pages.

  3. Your web site contains dynamically generated content.

  4. You have data besides HTML pages and images, such as Perl programs.

This article demonstrates how to recursively traverse an FTP site using the Net::FTP module bundled with Perl and available on CPAN. For the pedantically inclined, further background information regarding the FTP protocol is available in RFC 959 (http://www.yahoo.com/Computers_and_Internet/Standards/RFCs/).

Motivation

You may find yourself in the unenviable position of trying to maintain a remote file tree without shell access to the system where your file tree resides. Your file tree might contain a web site, an FTP site, or other data.

Many ISPs do not provide shell accounts, either for security reasons or because the host operating system has no concept of a remote login shell (such as Windows, or old versions of Mac OS). If you take the login shell out of the equation and wish to automate the process of moving ...

Get Computer Science & Perl Programming now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.