Skip to Content
Python数据处理
book

Python数据处理

by Jacqueline Kazil, Katharine Jarmul
July 2017
Intermediate to advanced
398 pages
11h 54m
Chinese
Posts & Telecom Press
Content preview from Python数据处理
140
7
在对数据集的后续处理过程中,你还会发现数据类型的离群值或
NA
回答。
处理这些不一致数据的最佳做法取决于你对该话题和数据集的熟悉程度,也
取决于你想要回答的问题。如果你要合并数据集,有时你可以舍弃那些离群
值和不良数据,但注意不要忽视微小的趋势。
现在我们已经初步找出了数据集中的离群值及其规律,下面我们继续清除另一种不良数
据——重复值,即使是我们自己也可能会创建重复值。
7.2.4
 找出重复值
如果你要处理的是同一调查数据的多个数据集,或者是可能包含重复值的原始数据,删除
重复数据是确保数据准确可用的重要步骤。如果你的数据集有唯一标识符,你可以利用这
ID
,确保没有误插入重复数据或获取重复数据。如果你的数据集没有索引,你可能需要
找到判断数据唯一性的好方法(例如创建一个可索引的键)。
Python
内置库中有几个判断数据唯一性的好方法。我们首先介绍一些概念:
list_with_dupes = [1, 5, 6, 2, 5, 6, 8, 3, 8, 3, 3, 7, 9]
set_without_dupes = set(list_with_dupes)
print
set_without_dupes
输出应该是这样的:
{1, 5, 6, 2, 6, 3, 6, 7, 3, 7, 9,}
这里发生了什么?集合(
set
)和
frozenset
都是
Python
的内置类型,输入一个可迭代对象
(比如列表、字符串或元组),返回一个包含唯一值的集合。
要使用集合和
frozenset
,输入的值需要是 ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

AirBnbBlueOriginElectronic ArtsHomeDepotNasdaqRakutenTata Consultancy Services

QuotationMarkO’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.
Julian F.
Head of Cybersecurity
QuotationMarkI wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.
Addison B.
Field Engineer
QuotationMarkI’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.
Amir M.
Data Platform Tech Lead
QuotationMarkI'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.
Mark W.
Embedded Software Engineer

You might also like

数据科学中的实用统计学(第2版)

数据科学中的实用统计学(第2版)

Peter Bruce, Andrew Bruce, Peter Gedeck
Java持续交付

Java持续交付

Daniel Bryant, Abraham Marín-Pérez
解密金融数据

解密金融数据

Justin Pauley

Publisher Resources

ISBN: 9787115459190