Skip to Content
Python数据处理
book

Python数据处理

by Jacqueline Kazil, Katharine Jarmul
July 2017
Intermediate to advanced
398 pages
11h 54m
Chinese
Posts & Telecom Press
Content preview from Python数据处理
266
12
我们使用
Qt
浏览器成功地执行了搜索。一些功能还没有像
Selenium
一样流畅,但是
Ghost.py
仍然是一个相当年轻的项目。
你可以通过评估版本号来评估一个项目的年龄。在编写本书的时候,
Ghost.
py
仍然低于
1.0
版本(事实上,本书可能只能兼容
0.2
发布版)。它可能会在
未来的几年里有大量的改变,但是这是一个非常有趣的项目。我们鼓励你通
过向作者提交想法以及研究和修复
bug
来帮助它。
现在,我们已经学习了
Python
中几种与浏览器交互的不同方式,让我们做一些爬取!
12.2
 爬取网页
如果你需要从网站的多个页面上抓取数据,爬虫可能是最好的解决方案。网络爬虫(或者
机器人)很适合跨越整个域名或站点(或一系列的域名或站点)寻找信息。
你可以将爬虫视为一个高级的抓取器,通过它你可以利用页面读取抓取器的
能力(类似于在第
11
章中学到的),并且在整个站点中应用匹配
URL
模式
的规则。
爬虫可以帮助你了解网站的结构。例如,站点可能包含一个你并不知道的完整的子章节,其
中包含一些有趣的数据。使用爬虫遍历域名,你可以找到子域或其他对报告有用的相关内容。
当你构建爬虫时,首先研究感兴趣的站点,然后创建页面读取的代码来识别和读取内容。
一旦爬虫构建完毕,你可以创建一个
遵循的规则
列表,爬虫会使用它找到其他有趣的页面
和内容,同时解析器会使用你创建的页面读取抓取器收集和保存内容。
使用爬虫时,你需要事先明确想要什么内容,或者首先使用一个宽泛的方法
探索站点,然后重新编写它,使其更明确。如果你选择了广撒网的方法
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

AirBnbBlueOriginElectronic ArtsHomeDepotNasdaqRakutenTata Consultancy Services

QuotationMarkO’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.
Julian F.
Head of Cybersecurity
QuotationMarkI wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.
Addison B.
Field Engineer
QuotationMarkI’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.
Amir M.
Data Platform Tech Lead
QuotationMarkI'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.
Mark W.
Embedded Software Engineer

You might also like

数据科学中的实用统计学(第2版)

数据科学中的实用统计学(第2版)

Peter Bruce, Andrew Bruce, Peter Gedeck
Java持续交付

Java持续交付

Daniel Bryant, Abraham Marín-Pérez
解密金融数据

解密金融数据

Justin Pauley

Publisher Resources

ISBN: 9787115459190