
✐
✐
“4137X˙CH07˙Akerkar” — 2007/9/8 — 11:21 — page 278 — #16
✐
✐
✐
✐
✐
✐
278 CHAPTER 7 Web Content Mining
site. For example, if a robot visits a site called http://www.myweb.ca/, it should first
check for http://www.myweb.ca/robots.txt. If this document exists, the robot should
parse it looking for records such as
User-agent: *
Disallow: /
These records indicate if robots are allowed to retrieve all documents from the website.
A site can have only a single “/robots.txt” file. Moreover, the file cannot be in any of
the user directories. A robot will never look for robots.txt appearing anywhere except
at the root of a web document hierarchy, such as http://www.m