Skip to Content
Python数据处理
book

Python数据处理

by Jacqueline Kazil, Katharine Jarmul
July 2017
Intermediate to advanced
398 pages
11h 54m
Chinese
Posts & Telecom Press
Content preview from Python数据处理
网页抓取:获取并存储网络数据
223
对于很多站点来说,页面的顶部部分包含到站点主要部分或者相关主题的导航和链接。链
接或者广告通常出现在页面两边向下延展的位置。页面的中间部分通常包含你想要抓取的内容。
熟悉大多数网页的结构(元素的视觉位置和它们在标记语言中的位置)会帮
助你从互联网上抓取数据。如果可以聚焦到数据源,你就可以快速地构建抓
取器。
一旦知道了在页面上寻找什么,并且通过学习页面源代码的结构分析了页面的组成,你就
可以确定如何收集页面中的重要的部分。许多网页在第一次页面加载的时候提供内容,或
者提供一个已加载好内容的缓存页面。对于这些页面,可以使用简单的
XML
HTML
析器(我们会在本章学习它们),并且从第一个
HTTP
响应(在你请求一个
URL
时浏览器
加载的内容)中直接读取内容。这与读取文档类似,只是需要一个初始的页面请求。
如果你需要首先同页面交互来获取数据(也就是输入数据和点击按钮),并且它不仅仅是
一个简单的
URL
的改变,你需要使用一个基于浏览器的抓取器,在浏览器中打开页面同
它交互。
如果需要遍历整个网站来收集数据,你会想要一个
爬虫
:一个机器人,它爬取网页,并且
根据规则识别好的内容或跟踪更多页面。我们在爬取中使用的库非常地快速、灵活,让编
写这些类型的脚本变得十分简单。
在开始编写抓取器代码之前,我们会查看一些网站,习惯于分析要使用那个类型的抓取器
(页面读取器、浏览器读取器或爬虫),以及抓取数据会多难或多简单。有时,确定数据值
得付出多少努力是很重要的。我们会介绍一些工具来确定为抓取数据需要付出多少努力 ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

AirBnbBlueOriginElectronic ArtsHomeDepotNasdaqRakutenTata Consultancy Services

QuotationMarkO’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.
Julian F.
Head of Cybersecurity
QuotationMarkI wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.
Addison B.
Field Engineer
QuotationMarkI’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.
Amir M.
Data Platform Tech Lead
QuotationMarkI'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.
Mark W.
Embedded Software Engineer

You might also like

数据科学中的实用统计学(第2版)

数据科学中的实用统计学(第2版)

Peter Bruce, Andrew Bruce, Peter Gedeck
Java持续交付

Java持续交付

Daniel Bryant, Abraham Marín-Pérez
解密金融数据

解密金融数据

Justin Pauley

Publisher Resources

ISBN: 9787115459190