Skip to Content
Python数据处理
book

Python数据处理

by Jacqueline Kazil, Katharine Jarmul
July 2017
Intermediate to advanced
398 pages
11h 54m
Chinese
Posts & Telecom Press
Content preview from Python数据处理
数据清洗:研究、匹配与格式化
143
创建唯一键的方法有很多。我们可以用采访的开始时间作为唯一键。但我们不确定
UNICEF
是否同时安排了多个调查组。如果是的话,我们可能会将事实上不是重复值的元
素当作重复值删掉。我们可以用被采访人的出生日期和采访时间一起做唯一键,这样不太
可能有重复值,但如果有字段缺失的话就麻烦了。
一种优雅的解决方法是,检查类群编号、家庭编号和家庭成员编号三者是否构成唯一键。
如果是的话,我们可以将这个方法应用到整个数据集上——即使没有采访起止时间也可
以。我们来试一下!
set_of_keys = set([
'%s-%s-%s' % (x[0][1], x[1][1], x[2][1])
for
x
in
zipped_data])
uniques = [x
for
x
in
zipped_data
if not
set_of_keys.remove(
'%s-%s-%s' % (x[0][1], x[1][1], x[2][1]))]
print
len(set_of_keys)
利用类群编号、家庭编号和家庭成员编号创建一个字符串,我们认为这三个编号的组合
是唯一的。我们将三个编号用
-
隔开,这样方便区分。
利用
remove
方法重新创建我们用到的唯一键。这样会一个个删除所有数据,
uniques
表包含每一个唯一数据。如果有重复数据的话,代码还会抛出错误。
计算唯一键列表的长度。我们可以知道数据集中有多少个唯一值。
太好了!这一次没有报错。从列表的长度中可以看出 ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

AirBnbBlueOriginElectronic ArtsHomeDepotNasdaqRakutenTata Consultancy Services

QuotationMarkO’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.
Julian F.
Head of Cybersecurity
QuotationMarkI wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.
Addison B.
Field Engineer
QuotationMarkI’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.
Amir M.
Data Platform Tech Lead
QuotationMarkI'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.
Mark W.
Embedded Software Engineer

You might also like

数据科学中的实用统计学(第2版)

数据科学中的实用统计学(第2版)

Peter Bruce, Andrew Bruce, Peter Gedeck
Java持续交付

Java持续交付

Daniel Bryant, Abraham Marín-Pérez
解密金融数据

解密金融数据

Justin Pauley

Publisher Resources

ISBN: 9787115459190