May 2017
Beginner to intermediate
220 pages
5h 2m
English
For scraping the annotated fields Portia uses a library called Scrapely (https://github.com/scrapy/scrapely), which is a useful open-source tool developed independently from Portia. Scrapely uses training data to build a model of what to scrape from a web page. The trained model can then be applied to scrape other web pages with the same structure.
You can install it using pip:
pip install scrapely
Here is an example to show how it works:
>>> from scrapely import Scraper>>> s = Scraper()>>> train_url = 'http://example.webscraping.com/view/Afghanistan-1'>>> s.train(train_url, {'name': 'Afghanistan', 'population': '29,121,286'})>>> test_url = 'http://example.webscraping.com/view/United-Kingdom-239'>>> s.scrape(test_url) ...Read now
Unlock full access