Skip to Content
Python数据处理
book

Python数据处理

by Jacqueline Kazil, Katharine Jarmul
July 2017
Intermediate to advanced
398 pages
11h 54m
Chinese
Posts & Telecom Press
Content preview from Python数据处理
网页抓取:获取并存储网络数据
249
XPath
div/span
返回每个子
div
元素下所有的
span
元素。
为了找到每个元素的链接,这行代码调用第一个
span
data-src
属性。如果
link
变量
None
,代码会在我们的数据字典中设置
emoji_link
属性为
None
因为
data-src
保存着一个相对
URL
,所以这行代码使用
base_url
属性来创建一个完整
的绝对
URL
为了得到句柄(
handle
)或唤起
emoji
表情所需的文字,这行代码抓取第二个
span
的文本。
不同于链接的逻辑,我们不需要测试这是否存在,因为每一个
emoji
都拥有一个句柄。
对于包括
Basecamp
声效的页面,对于每一个列表对象,存在一个
div
(你可以通过使
用浏览器的开发者工具检视页面,轻松地找到它)。这行代码选择
div
,并且抓取其中
的文本内容。因为这行代码在
else
代码块中,所以我们知道这些只是声音文件,因为
它们不使用
spans
通过重写
emoji
代码来使用
XPath
关系,我们发现标签最后的代码块是声音,并且其中的数
据以不同的方式存储。相对于在
span
中保存一个链接,这里只有一个
div
包含唤醒声音的
文本。如果只想要
emoji
链接,可以跳过添加它们到列表对象的迭代。取决于你感兴趣的数
据,代码会有很大相同,但是你总是可以轻松地利用
if...else
逻辑来确定需要的内容。
通过不超过
30
行的代码,我们创建了一个抓取器来请求页面,通过
XPath
遍历
DOM
系解析它,同时使用合适的属性或文本内容抓取出需要的内容。这段代码具有很好的扩展 ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

AirBnbBlueOriginElectronic ArtsHomeDepotNasdaqRakutenTata Consultancy Services

QuotationMarkO’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.
Julian F.
Head of Cybersecurity
QuotationMarkI wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.
Addison B.
Field Engineer
QuotationMarkI’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.
Amir M.
Data Platform Tech Lead
QuotationMarkI'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.
Mark W.
Embedded Software Engineer

You might also like

数据科学中的实用统计学(第2版)

数据科学中的实用统计学(第2版)

Peter Bruce, Andrew Bruce, Peter Gedeck
Java持续交付

Java持续交付

Daniel Bryant, Abraham Marín-Pérez
解密金融数据

解密金融数据

Justin Pauley

Publisher Resources

ISBN: 9787115459190