Skip to Content
Python数据处理
book

Python数据处理

by Jacqueline Kazil, Katharine Jarmul
July 2017
Intermediate to advanced
398 pages
11h 54m
Chinese
Posts & Telecom Press
Content preview from Python数据处理
121
7
数据清洗:研究、匹配与格式化
数据清洗并不是最迷人的工作,却是数据处理的重要组成部分。要想成为数据清洗专家,
需要严谨的态度,以及对所研究领域全面系统的知识。学会如何正确地清洗数据并汇总,
可以让你在研究领域中脱颖而出。
Python
的设计很适合数据清洗,它可以创建函数处理相同的规律,减少重复性工作。根据
我们目前所学的代码知识,学会用脚本和代码处理重复性的问题,可以节省数小时的体力
劳动,只需要运行一次脚本就可以完成。
本章我们将学习如何用
Python
清洗数据和格式化数据。我们还会用
Python
寻找数据集中
的重复数据和错误。在下一章里我们会继续学习数据清洗,特别是清洗过程自动化和清洗
后的数据存储。
7.1
 为什么要清洗数据
对于你获取的数据,有些可能格式良好,方便使用。如果真是这样的话,那你很幸运!大
部分数据即使清洗过,也会有格式不一致和可读性的问题,例如首字母缩写或描述性标题
不匹配,特别是数据来自多个数据集。除非你在数据格式化和标准化上花点工夫,否则数
据不可能正确合并,也就没有用处了。
清洗数据可以让数据更容易存储、搜索和复用。我们在第
6
章中学过,先清
洗数据,再把数据保存到适当的模型中会容易得多。想象一个数据集中有很
多列(或字段),应该保存成特定的数据类型,比如日期、号码或电子邮件
地址。如果你能将预期格式标准化,清洗或删除不合格的数据,就可以保证
数据的一致性,在以后需要查询数据集时也不用做大量工作。
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

AirBnbBlueOriginElectronic ArtsHomeDepotNasdaqRakutenTata Consultancy Services

QuotationMarkO’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.
Julian F.
Head of Cybersecurity
QuotationMarkI wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.
Addison B.
Field Engineer
QuotationMarkI’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.
Amir M.
Data Platform Tech Lead
QuotationMarkI'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.
Mark W.
Embedded Software Engineer

You might also like

数据科学中的实用统计学(第2版)

数据科学中的实用统计学(第2版)

Peter Bruce, Andrew Bruce, Peter Gedeck
Java持续交付

Java持续交付

Daniel Bryant, Abraham Marín-Pérez
解密金融数据

解密金融数据

Justin Pauley

Publisher Resources

ISBN: 9787115459190