Skip to Content
Python文本分析
book

Python文本分析

by Jens Albrecht, Sidharth Ramachandran, Christian Winkler
August 2022
Intermediate to advanced
441 pages
11h 26m
Chinese
China Electric Power Press Ltd.
Content preview from Python文本分析
网页抓取与数据提取
75
有关更多信息,请参见搜索引擎的帮助页面,例如谷歌的可编程搜索引擎
页面(
https://oreil.ly/PWHOS
)。
请注意,搜索引擎的构建主要面向交互操作。如果你执行太多(自动)搜索,
则引擎就会要求你填写验证码,并最终禁止你继续搜索。
3.4 URL
生成
为了下载路透社新闻存档的内容,首先我们需要知道页面的
URL
。在掌握了这些
URL
之后,下载本身就很容易了,因为强大的
Python
工具可以完成下载的任务。
虽然找到
URL
看似很容易,但是实际上并非那么简单。这个过程称为“
URL
成”,在许多爬虫项目中,这是最艰难的任务之一。我们必须确保我们不会遗漏任
URL
。因此,从一开始就仔细思考整个过程至关重要。如果执行得当,
URL
生成
还可以节省大量时间。
下载之前
请注意:有时下载数据是非法的。具体的规则和法律规定取决于托管数
据的国家
/
地区以及你打算将数据下载到哪个国家
/
地区。通常,网站
上都有一个名为“使用条款”的页面或类似的页面,你需要仔细阅读。
如果下载数据的目的仅仅是临时保存,那么适用搜索引擎的条款也同样
适用下载数据。由于谷歌等搜索引擎无法阅读和理解每个页面的使用条
款,因此有一个非常古老的协议,名叫机器人排除标准(
https://oreil.ly/
IWysG
)。使用该协议的网站都会在根目录下放置一个名为
robots.txt
的文
件。该文件可以被下载并自动解读。如果只处理一个网站,那么手动阅读
并理解该文件也是可行的。我们的经验法则是:如果没有
Disallow:
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

AirBnbBlueOriginElectronic ArtsHomeDepotNasdaqRakutenTata Consultancy Services

QuotationMarkO’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.
Julian F.
Head of Cybersecurity
QuotationMarkI wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.
Addison B.
Field Engineer
QuotationMarkI’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.
Amir M.
Data Platform Tech Lead
QuotationMarkI'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.
Mark W.
Embedded Software Engineer

You might also like

精益AI

精益AI

Lomit Patel
构建知识图谱

构建知识图谱

Jesus Barrasa, Jim Webber
写给系统管理员的Python脚本编程指南

写给系统管理员的Python脚本编程指南

Posts & Telecom Press, Ganesh Sanjiv Naik

Publisher Resources

ISBN: 9787519864446