Skip to Content
Python数据处理
book

Python数据处理

by Jacqueline Kazil, Katharine Jarmul
July 2017
Intermediate to advanced
398 pages
11h 54m
Chinese
Posts & Telecom Press
Content preview from Python数据处理
153
8
数据清洗:标准化和脚本化
你已经学习了数据的匹配和解析方法,以及如何寻找重复值,你已经开始探索数据清洗的
奇妙世界。随着进一步理解你的数据集和你想要回答的问题,你需要考虑数据标准化和清
洗自动化的问题。
本章我们将探索数据标准化的方法和时机,以及何时将数据清洗脚本化并对脚本进行测
试。如果你管理的数据集是定期更新或新增数据的话,你需要使清洗过程尽可能高效清
楚,这样你就可以将更多时间花在数据分析和撰写报告上。我们首先讲数据集的标准化
standardizing
)和归一化(
normalizing
),以及如果数据集没有归一化应该怎么做。
8.1
 数据归一化和标准化
数据集的标准化和归一化可能意味着利用当前数据计算新数据,也可能是对特定列或特定
数据进行标准化或归一化,这取决于你的数据和所从事的研究类型。
从统计学的观点来看,归一化通常需要对数据集进行计算,使数据都位于一个特定的范
围。比如说,你可能需要将测验成绩归一化到一定范围,这样你就可以准确查看成绩分
布。你可能还需要对数据做归一化,以便准确查看百分位数,或不同群体(或世代)之间
的百分位数。
假设你想查看某队在给定赛季得分的分布情况。你可能首先会将比赛分为赢、输、平三种
情况。然后再进一步分为赢多少分、输多少分,等等。你还可以按比赛时长和每分钟得分
数来分类。你可以访问所有这些数据集,现在你希望在球队之间进行对比。如果要对数据
归一化,你可能会将总得分归一化到
0-1
区间。离群值(最高得分)将会接近于
1
,较低
得分将会接近于
0
。然后你可以利用新数据的分布情况,查看有多少支球队的得分位于中 ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

AirBnbBlueOriginElectronic ArtsHomeDepotNasdaqRakutenTata Consultancy Services

QuotationMarkO’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.
Julian F.
Head of Cybersecurity
QuotationMarkI wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.
Addison B.
Field Engineer
QuotationMarkI’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.
Amir M.
Data Platform Tech Lead
QuotationMarkI'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.
Mark W.
Embedded Software Engineer

You might also like

数据科学中的实用统计学(第2版)

数据科学中的实用统计学(第2版)

Peter Bruce, Andrew Bruce, Peter Gedeck
Java持续交付

Java持续交付

Daniel Bryant, Abraham Marín-Pérez
解密金融数据

解密金融数据

Justin Pauley

Publisher Resources

ISBN: 9787115459190