Skip to Content
Python数据处理
book

Python数据处理

by Jacqueline Kazil, Katharine Jarmul
July 2017
Intermediate to advanced
398 pages
11h 54m
Chinese
Posts & Telecom Press
Content preview from Python数据处理
284
12
构)。不要彻底失去希望;很可能你的脚本会工作很长一段时间!
无论怎样,我们不想给你虚假的希望。你的脚本最终会崩溃。总有一天,你会继续运行
它,然后发现它不再工作了。当发生这些情况时,给自己一个大大的拥抱,为自己冲一杯
茶或咖啡,然后重新开始。
现在你知道了更多关于检验网站上的内容和为报告找出最有用的那部分的方法。你已经有
了相当多的代码,大部分仍然能够工作。你现在处在一个好的调试阶段,并且有很多工具
任你使用,以找到新的
div
或包含所需数据的表。
12.5
 几句忠告
当抓取网页时,谨慎是很重要的。你还需要了解所在国家关于网页内容的法律。一般来
说,如何做到谨慎是很显然的。不要把别人的内容当作自己的来用。不要使用已经声明不
允许分享的内容。不要向别人或网站发送垃圾邮件。不要攻击网站或恶意地爬取站点。最
基本地,不要做一个蠢人!如果你不能同母亲或其他亲近的人分享正在做的事情,并且感
觉良好,那就不要做。
有几种方式来明确你在互联网上做的事情。许多抓取库允许你发送
User-Agent
字符串。你
可以将自己的信息或者公司的信息放到这些字符串中,这样抓取者的信息就很清晰。同
时,确保查看站点的
robot.txt
文件(
http://www.robotstxt.org/robotstxt.html
),它会告诉网页
抓取器站点中禁止爬取的内容。
在构建爬虫遍历一个站点之前,看一下站点中你感兴趣的部分是否包含在
robot.txt
Disallow
小节中。如果它们存在其中,你需要找到别的方式来
获得数据,或者联系站点的拥有者,看看他们是否会通过其他方式为你提供 ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

AirBnbBlueOriginElectronic ArtsHomeDepotNasdaqRakutenTata Consultancy Services

QuotationMarkO’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.
Julian F.
Head of Cybersecurity
QuotationMarkI wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.
Addison B.
Field Engineer
QuotationMarkI’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.
Amir M.
Data Platform Tech Lead
QuotationMarkI'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.
Mark W.
Embedded Software Engineer

You might also like

数据科学中的实用统计学(第2版)

数据科学中的实用统计学(第2版)

Peter Bruce, Andrew Bruce, Peter Gedeck
Java持续交付

Java持续交付

Daniel Bryant, Abraham Marín-Pérez
解密金融数据

解密金融数据

Justin Pauley

Publisher Resources

ISBN: 9787115459190