Skip to Content
Python文本分析
book

Python文本分析

by Jens Albrecht, Sidharth Ramachandran, Christian Winkler
August 2022
Intermediate to advanced
441 pages
11h 26m
Chinese
China Electric Power Press Ltd.
Content preview from Python文本分析
98
3
为重新下载的次数常常会超出你的想象。不必重复下载所有内容是一个巨大的优势,
尤其是在开发过程中。
如果你下载并提取了数据,则应该永久地保存起来,以备今后使用。一种简单的方
法是将数据保存到
JSON
文件中。如果文件很多,则可以考虑使用目录结构。随着
页面数量的增加,你就会发现这种方法的扩展性不是很好,这时就应该考虑使用数
据库或其他列式数据存储。
3.15
基于密度的文本提取
提取
HTML
中的结构化数据并不复杂,但是很繁琐。如果你想提取整个网站的数据,
则上述方法是一个不错的选择,因为你只需要实现有限的几个页面类型的提取。
然而,有时你需要从许多不同的网站提取文本。针对每个网站实现提取文本并不是
可扩展的方法。有些元数据很容易找到,例如标题、描述等,但查找文本本身可并
不容易。
你需要看一看信息的密度,然后通过一些启发式的方法提取文本。这种方式背后
的算法衡量的是信息的密度,因此能够自动消除页眉、导航、页脚等重复性信
息。虽然实现不是很简单,但有一个现成的库
python-readability
https://oreil.
ly/AemZh
),可供我们使用。这个名字源于一个现已无人维护的浏览器插件:
Readability
,该插件旨在消除网页上的混乱,并提高网页的可读性,而这正是
我们这里所需的功能。首先,我们需要安装
python-readability
pip install
readability-lxml
)。
3.15.1
利用
Readability
读取路透社的内容
下面,我们以路透社为例,看看如何使用 ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

AirBnbBlueOriginElectronic ArtsHomeDepotNasdaqRakutenTata Consultancy Services

QuotationMarkO’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.
Julian F.
Head of Cybersecurity
QuotationMarkI wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.
Addison B.
Field Engineer
QuotationMarkI’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.
Amir M.
Data Platform Tech Lead
QuotationMarkI'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.
Mark W.
Embedded Software Engineer

You might also like

精益AI

精益AI

Lomit Patel
构建知识图谱

构建知识图谱

Jesus Barrasa, Jim Webber
写给系统管理员的Python脚本编程指南

写给系统管理员的Python脚本编程指南

Posts & Telecom Press, Ganesh Sanjiv Naik

Publisher Resources

ISBN: 9787519864446