Skip to Content
精通特征工程
book

精通特征工程

by Alice Zheng, Amanda Casari
April 2019
Intermediate to advanced
172 pages
4h 39m
Chinese
Posts & Telecom Press
Content preview from 精通特征工程
80
5
普通
one-hot
编码
空间要求 使用稀疏向量格式时为
O
(
n
)
,其中
n
是数据点的个数
计算能力要求 线性模型下为
O
(
nk
)
,其中
k
是类别数量
优点
容易实现
可能是最精确的
可用于在线学习
缺点
计算效率不高
不能适应可增长的类别
只适用于线性模型
对于大数据集,需要大规模的分布式优化
特征散列化
空间要求 使用稀疏向量格式时为
O
(
n
)
,其中
n
是数据点的个数
计算能力要求 线性模型和核方法下为
O
(
nm
)
,其中
m
是散列分箱的个数
优点
容易实现
模型训练成本更低
容易适应新类别
容易处理稀有类
可用于在线学习
缺点
只适合线性模型或核方法
散列后的特征无法解释
精确度难以保证
分箱计数
空间要求
O
(
n
+
k
)
,将每个数据点表示为小而密集的向量,加上为每个类别保存的计数统计量
计算能力要求 线性模型下为
O
(
n
)
,也适用于非线性模型,比如树
优点
训练阶段的计算负担最小
可用于基于树的模型
比较容易适应新类别
可使用
back-off
方法或最小计数图处理稀有类
可解释
缺点
需要历史数据
需要延迟更新,不完全适合在线学习
很可能导致数据泄露
正如你看到的,没有一种方法是完美无缺的。应该使用哪种方法取决于具体的模型。线性
模型的训练成本低,因此可以使用未经压缩的特征表示,比如
one-hot
编码。另一方面,
基于树的模型需要在所有特征中重复搜索以进行正确的分割,因此只能使用规模较小的特
征表示,比如分箱计数。特征散列化位于这两个极端之间,但结果的准确率众说纷纭。 ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

AirBnbBlueOriginElectronic ArtsHomeDepotNasdaqRakutenTata Consultancy Services

QuotationMarkO’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.
Julian F.
Head of Cybersecurity
QuotationMarkI wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.
Addison B.
Field Engineer
QuotationMarkI’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.
Amir M.
Data Platform Tech Lead
QuotationMarkI'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.
Mark W.
Embedded Software Engineer

You might also like

精通機器學習

精通機器學習

Aurélien Géron

Publisher Resources

ISBN: 9787115509680