Skip to Content
Python机器学习手册:从数据预处理到深度学习
book

Python机器学习手册:从数据预处理到深度学习

by Chris Albon
July 2019
Intermediate to advanced
365 pages
8h 13m
Chinese
Publishing House of Electronics Industry
Content preview from Python机器学习手册:从数据预处理到深度学习
4.8
 将特征离散化
75
Price Bathrooms Square_Feet Outlier Log_Of_Sqaure_Feet
0 534433 2.0 1500 0 7.313220
1 392333 3.5 2500 0 7.824046
2 293222 2.0 1500 0 7.313220
3 4322032 116.0 48000 1 10.778956
讨论
和识别异常值一样,处理异常值时也不存在一个绝对准则。应该基于两个方面来考虑对
异常值的处理。第一,要弄清楚是什么让它们成为异常值的。如果你认为它们是错误的
观察值,比如它们来自一个坏掉的传感器或者是被记错了的值,那么就要丢弃它们或者
NaN
来替换异常值,因为我们无法信任这些值。但是,如果你认为这些异常值真的就
是极端值(例如一幢大宅子有
200
间卧室),那么把它们标记为异常值或者对它们的值
进行转换,是更合理的做法。
第二,应该基于机器学习的目标来处理异常值。例如,如果想要基于房屋的特征来预测
其价格,那么可以合理地假设有
100
间卧室的大宅子的价格是由不同于普通家庭住宅的
特征驱动的。此外,如果使用一个在线住房贷款的
Web
应用的部分数据来训练一个模型,
那么就要假设潜在用户中不存在想要买一栋有几百间卧室的豪宅的亿万富翁。
所以,对于异常值到底要如何处理呢?首先,想一想它们为什么是异常值,然后对于数
据要有一个最终的目标。最重要的是,要记住“决定不处理异常值”本身就是一个有潜
在影响的决定。
另外,如果数据中有异常值,那么采用标准化方法做缩放就不太合适了,因为平均值和 ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

AirBnbBlueOriginElectronic ArtsHomeDepotNasdaqRakutenTata Consultancy Services

QuotationMarkO’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.
Julian F.
Head of Cybersecurity
QuotationMarkI wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.
Addison B.
Field Engineer
QuotationMarkI’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.
Amir M.
Data Platform Tech Lead
QuotationMarkI'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.
Mark W.
Embedded Software Engineer

You might also like

精通特征工程

精通特征工程

Alice Zheng, Amanda Casari
精通機器學習

精通機器學習

Aurélien Géron
Python数据分析基础

Python数据分析基础

Clinton W. Brownley

Publisher Resources

ISBN: 9787121369629