Skip to Content
Python数据处理
book

Python数据处理

by Jacqueline Kazil, Katharine Jarmul
July 2017
Intermediate to advanced
398 pages
11h 54m
Chinese
Posts & Telecom Press
Content preview from Python数据处理
数据清洗:研究、匹配与格式化
123
仔细观察文件及其包含的数据,我们从这里开始数据清洗过程。数据清洗的第一步通常是
简单的目视分析。我们仔细观察文件,看看能发现什么!
7.2.1
 找出需要清洗的数据
数据清洗的第一步是,观察数据字段,仔细寻找不一致的地方。如果在数据清洗之初就可
以使数据看起来更加干净的话,你会更容易找到在数据归一化过程中需要解决的最初问题。
我们来看一下
mn.csv
文件。文件中包含原始数据,并用首字母缩写作为标题,这些缩写的
含义可能很好翻译。我们来看一下
mn.csv
文件的列标题:
"","HH1","HH2","LN","MWM1","MWM2", ...
每一项都代表调查中的一个问题或数据,我们想要的是可读性更强的版本。通过谷歌
搜索,我们在分享
MICS
数据的世界银行网站(
http://microdata.worldbank.org/index.php/
catalog/1794/datafile/F5
)上找到了这些标题的具体含义。
花点时间研究世界银行网站上有没有缩写一览表,可以帮你更好地完成数据
清洗工作。你还可以给该机构打电话,询问是否有易于使用的缩写列表。
利用在第
11
章即将学到的一些网络抓取技术,我们可以得到一个
CSV
文件,里面包含这
些标题及其英文释义,以及世界银行在采集这些
MICS
数据时所问的问题。我们已将网
络抓取生成的新标题放在本书仓库中(
mn-headers.csv
)。我们希望将这些标题与调查数据
一一对应,这样就有了可读性较强的问题和答案。我们有几种方法可以做到这一点。 ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

AirBnbBlueOriginElectronic ArtsHomeDepotNasdaqRakutenTata Consultancy Services

QuotationMarkO’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.
Julian F.
Head of Cybersecurity
QuotationMarkI wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.
Addison B.
Field Engineer
QuotationMarkI’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.
Amir M.
Data Platform Tech Lead
QuotationMarkI'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.
Mark W.
Embedded Software Engineer

You might also like

数据科学中的实用统计学(第2版)

数据科学中的实用统计学(第2版)

Peter Bruce, Andrew Bruce, Peter Gedeck
Java持续交付

Java持续交付

Daniel Bryant, Abraham Marín-Pérez
解密金融数据

解密金融数据

Justin Pauley

Publisher Resources

ISBN: 9787115459190