Skip to Content
Python机器学习基础教程
book

Python机器学习基础教程

by Andreas C. Müller, Sarah Guido
January 2018
Intermediate to advanced
301 pages
8h 54m
Chinese
Posts & Telecom Press
Content preview from Python机器学习基础教程
数据表示与特征工程
181
有时也是一个好主意。尝试预测计数(比如订单数量)是一项相当常见的任务,而且使用
log(y + 1) 变换也往往有用。
3
从前面的例子中可以看出,分箱、多项式和交互项都对模型在给定数据集上的性能有很大
影响,对于复杂度较低的模型更是这样,比如线性模型和朴素贝叶斯模型。与之相反,基
于树的模型通常能够自己发现重要的交互项,大多数情况下不需要显式地变换数据。其他
模型,比如
SVM
、最近邻和神经网络,有时可能会从使用分箱、交互项或多项式中受益,
但其效果通常不如线性模型那么明显。
4.5
 自动化特征选择
有了这么多种创建新特征的方法,你可能会想要增大数据的维度,使其远大于原始特征的
数量。但是,添加更多特征会使所有模型变得更加复杂,从而增大过拟合的可能性。在添
加新特征或处理一般的高维数据集时,最好将特征的数量减少到只包含最有用的那些特
征,并删除其余特征。这样会得到泛化能力更好、更简单的模型。但你如何判断每个特征
的作用有多大呢?有三种基本的策略:
单变量统计
univariate statistics
)、
基于模型的选择
model-based selection
)和
迭代选择
iterative selection
)。我们将详细讨论这三种策略。所
有这些方法都是监督方法,即它们需要目标值来拟合模型。这也就是说,我们需要将数据
划分为训练集和测试集,并只在训练集上拟合特征选择。
4.5.1
 单变量统计
在单变量统计中,我们计算每个特征和目标值之间的关系是否存在统计显著性,然后选
择具有最高置信度的特征 ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

AirBnbBlueOriginElectronic ArtsHomeDepotNasdaqRakutenTata Consultancy Services

QuotationMarkO’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.
Julian F.
Head of Cybersecurity
QuotationMarkI wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.
Addison B.
Field Engineer
QuotationMarkI’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.
Amir M.
Data Platform Tech Lead
QuotationMarkI'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.
Mark W.
Embedded Software Engineer

You might also like

数据驱动力:企业数据分析实战

数据驱动力:企业数据分析实战

Carl Anderson
Python应用开发指南

Python应用开发指南

Posts & Telecom Press, Ninad Sathaye
管理Kubernetes

管理Kubernetes

Brendan Burns, Craig Tracey

Publisher Resources

ISBN: 9787115475619