Hoovering a website
Very frequently, it is of interest to scan a website and extract information from specific tags. This basic mechanism can be used to trawl the web in search of useful bits of information. At other times you need to get a list of
<IMG> tags and the
SRC attribute, or
<A> tags and the corresponding
HREF attribute. The possibilities are endless.
How to do it...
- First of all, we need to grab the contents of the target website. At first glance it seems that we should make a cURL request, or simply use
file_get_contents(). The problem with these approaches is that we will end up having to do a massive amount of string manipulation, most likely having to make inordinate use of the dreaded regular expression. In order to avoid all of ...