Skip to Content
Python数据处理
book

Python数据处理

by Jacqueline Kazil, Katharine Jarmul
July 2017
Intermediate to advanced
398 pages
11h 54m
Chinese
Posts & Telecom Press
Content preview from Python数据处理
数据获取与存储
105
对第一个数据集做过验证和真实性核查之后,以后编写脚本验证数据有效性就会容易很多。
你甚至可以用在本书中学到的技巧(特别是第
14
章的内容)创建脚本来自动更新数据。
6.3
 数据可读性
数据清洁度和数据寿命
如果你的数据集看起来非常难以读取,还有一种可能的方法:根据第
7
章学习的内容,你
可以用代码清洗数据。幸运的是,如果是计算机创建的数据,很有可能可以被计算机读
取。更大的难点在于,从“真实生活”中获取数据并读入计算机。在第
5
章中我们知道,
PDF
和不常见的数据文件类型很难处理,但并非不可能。
我们可以用
Python
帮我们读取难以读取的数据,但难以读取可能意味着数据来源不佳。如
果是计算机生成的大型数据集,那就存在一个问题——数据库转储(
database dump
)的格
式一直都不美观。但如果你的数据是人工生成的,而且难以读取,那可能是数据清洁度和
数据有效性的问题。
你面临的另一个问题是,数据是否
已经
被清洗过了。通过详细询问数据是如何采集、报告
并更新的,你可以判断数据是否被清洗过。你应该能够确定以下内容。
数据的清洁度有多高?
是否有人给出了统计误差率,或者修改了错误的数据条目,或者误报了数据?
是否会发布进一步更新,这些更新是否会发送给你?
数据采集过程中使用了哪些方法,如何验证这些方法?
如果你的数据源使用的是标准化的、严谨的研究和采集方法,在未来的几年
里,你的清洗脚本和报告脚本可能几乎不用修改就可以重复使用。那些系统
通常不会定期发生变化(变化既费钱又费时)。一旦写好了清洗脚本 ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

AirBnbBlueOriginElectronic ArtsHomeDepotNasdaqRakutenTata Consultancy Services

QuotationMarkO’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.
Julian F.
Head of Cybersecurity
QuotationMarkI wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.
Addison B.
Field Engineer
QuotationMarkI’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.
Amir M.
Data Platform Tech Lead
QuotationMarkI'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.
Mark W.
Embedded Software Engineer

You might also like

数据科学中的实用统计学(第2版)

数据科学中的实用统计学(第2版)

Peter Bruce, Andrew Bruce, Peter Gedeck
Java持续交付

Java持续交付

Daniel Bryant, Abraham Marín-Pérez
解密金融数据

解密金融数据

Justin Pauley

Publisher Resources

ISBN: 9787115459190