Skip to Content
Python数据处理
book

Python数据处理

by Jacqueline Kazil, Katharine Jarmul
July 2017
Intermediate to advanced
398 pages
11h 54m
Chinese
Posts & Telecom Press
Content preview from Python数据处理
251
12
高级网页抓取:屏幕抓取器与爬虫
在第
11
章你已经开始培养网页抓取技能,学习了如何确定要抓取的内容,以及用什么方
式去哪里抓取。在这一章,我们会学习用更高级的抓取器来收集内容,比如基于浏览器的
抓取器和爬虫。
我们还会学习使用高级网页抓取工具调试常见问题,并介绍在抓取网页时会遇到的一些道
德问题。首先,我们会研究基于浏览器的网页抓取:通过
Python
直接使用浏览器从网页上
抓取内容。
12.1
 基于浏览器的解析
有时,站点使用大量的
JavaScript
或其他页面加载后执行的代码来给页面填充内容。在这
些情况中,使用一个普通的网页抓取器来分析站点几乎是不可能的。你最后得到的是一个
空白的页面。如果你想要同页面进行交互(即,如果你需要点击按钮或者输入一些搜索文
本),也会碰到相同的问题。无论哪一种情况,你需要找出
屏幕阅读
screen read
)页面的
方法。屏幕读取器使用浏览器打开页面,在浏览器中加载页面之后读取并同它交互。
屏幕读取器很擅长执行通过一系列操作来获取信息才能完成的任务。出于这
个原因,屏幕读取器脚本也是自动化常规网页任务的简单方式。
Python
中最常用的屏幕读取库是
Selenium
http://selenium.googlecode.com/svn/trunk/
docs/api/py/index.html
)。
Selenium
是一个
Java
程序,用来打开浏览器,并且通过读取页
面同页面交互。如果你已经了解
Java
,可以使用
Java IDE
来与浏览器交互。我们会通过
Python ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

AirBnbBlueOriginElectronic ArtsHomeDepotNasdaqRakutenTata Consultancy Services

QuotationMarkO’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.
Julian F.
Head of Cybersecurity
QuotationMarkI wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.
Addison B.
Field Engineer
QuotationMarkI’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.
Amir M.
Data Platform Tech Lead
QuotationMarkI'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.
Mark W.
Embedded Software Engineer

You might also like

数据科学中的实用统计学(第2版)

数据科学中的实用统计学(第2版)

Peter Bruce, Andrew Bruce, Peter Gedeck
Java持续交付

Java持续交付

Daniel Bryant, Abraham Marín-Pérez
解密金融数据

解密金融数据

Justin Pauley

Publisher Resources

ISBN: 9787115459190