Skip to Content
Python文本分析
book

Python文本分析

by Jens Albrecht, Sidharth Ramachandran, Christian Winkler
August 2022
Intermediate to advanced
441 pages
11h 26m
Chinese
China Electric Power Press Ltd.
Content preview from Python文本分析
94
3
3.14
案例:爬虫
以上,我们介绍了如何下载网页,以及如何使用
HTML
解析技术提取内容。
从业务的角度来看,通常我们需要的并不是处理单个页面,而是掌握整
体的状况。因此,你需要更多的内容。
幸运的是,将我们已经掌握的知识组合起来,就足以下载内容存档或整个网站。通常,
我们需要分多个阶段进行:首先你需要生成
URL
,然后下载内容,然后再发现更多
URL
,如此循环反复。
本节,我们将详细介绍一个爬虫示例,并创建可伸缩的解决方案,你可以利用这个
方法下载成千上万(乃至数百万)的网页。
3.14.1
案例介绍
解析一篇路透社的文章是一个很不错的练习,但是路透社存档非常大,包含大量的
文章。但我们仍然可以使用前面介绍的技术来解析大量的数据。假设你需要下载和
提取某个大型的论坛,其中包含很多用户生成的内容,或者某个科学文章网站。如
前所述,通常我们很难找到各个文章的
URL
然而,我们的这个案例不太一样。尽管我们可以使用
sitemap.xml
,但路透社非常慷
慨地提供了专用的存档页面,网址为:
https://www.reuters.com/news/archive
。此外,
还提供分页功能,因此可以回溯过去的内容。
3-5
展示了下载部分存档的步骤(又名爬虫)。具体过程如下:
1.
定义应该下载多少页的存档。
2.
下载存档页面,并分别命名为:
page-000001.html
page-000002.html
,以此类推,
目的是为了便于检查。如果文件已存在,则跳过这一步。
3.
针对每个
page-*.html
文件,提取引用文章的 ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

AirBnbBlueOriginElectronic ArtsHomeDepotNasdaqRakutenTata Consultancy Services

QuotationMarkO’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.
Julian F.
Head of Cybersecurity
QuotationMarkI wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.
Addison B.
Field Engineer
QuotationMarkI’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.
Amir M.
Data Platform Tech Lead
QuotationMarkI'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.
Mark W.
Embedded Software Engineer

You might also like

精益AI

精益AI

Lomit Patel
构建知识图谱

构建知识图谱

Jesus Barrasa, Jim Webber
写给系统管理员的Python脚本编程指南

写给系统管理员的Python脚本编程指南

Posts & Telecom Press, Ganesh Sanjiv Naik

Publisher Resources

ISBN: 9787519864446