Skip to Content
Python文本分析
book

Python文本分析

by Jens Albrecht, Sidharth Ramachandran, Christian Winkler
August 2022
Intermediate to advanced
441 pages
11h 26m
Chinese
China Electric Power Press Ltd.
Content preview from Python文本分析
82
3
加密传输开始之前,创建对称的会话密钥(迪菲 - 赫尔曼密钥交换(
https://
oreil.ly/rOzKH
))。
浏览器非常聪明,为该会话密钥提供了特殊的缓存。在寻找专用的下载
程序时,请选择一个也支持这种缓存的程序。为了进一步降低延迟,可
以通过 HTTP 持续连接来重复利用 TCP 连接。Python requests 库通过
Session
抽象来支持该功能。
保存文件
在许多项目中,我们都可以将下载下来的 HTML 页面(临时)保存到文件
系统中。当然,你也可以即时提取结构化的内容,但是如果出现问题或页
面结构与预期不符,则很难调查和调试。保存文件非常有帮助,尤其是在
开发的过程中。
从简单的方法开始
在大多数情况下,我们都可以利用
requests
库下载页面。该库提供了一个
优秀的接口,而且可以在 Python 环境中运行。
避免被封禁
大多数网站都不愿意被别人抓取,而且有很大一部分都采取了对策。如果
你需要下载许多页面,则应该保持友好,并在两次请求之间添加一段等待
时间。
如果即使如此依然被封禁,则应该注意检查内容和响应代码。你可以更改
IP 地址或使用 IPv6、代理服务器、VPN 甚至 Tor 网络。
法律方面
根据你的住址和网站的使用条款,有时抓取网页根本是被禁止的。
3.9
案例:使用
Python
下载
HTML
页面
为了下载
HTML
页面,首先你需要知道
URL
。如上所述,
URL
都包含
在站点地图中。下面,我们就利用站点地图提供的列表来下载网页内容:
%%time
s = requests.Session() ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

AirBnbBlueOriginElectronic ArtsHomeDepotNasdaqRakutenTata Consultancy Services

QuotationMarkO’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.
Julian F.
Head of Cybersecurity
QuotationMarkI wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.
Addison B.
Field Engineer
QuotationMarkI’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.
Amir M.
Data Platform Tech Lead
QuotationMarkI'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.
Mark W.
Embedded Software Engineer

You might also like

精益AI

精益AI

Lomit Patel
构建知识图谱

构建知识图谱

Jesus Barrasa, Jim Webber
写给系统管理员的Python脚本编程指南

写给系统管理员的Python脚本编程指南

Posts & Telecom Press, Ganesh Sanjiv Naik

Publisher Resources

ISBN: 9787519864446