Skip to Content
精通特征工程
book

精通特征工程

by Alice Zheng, Amanda Casari
April 2019
Intermediate to advanced
172 pages
4h 39m
Chinese
Posts & Telecom Press
Content preview from 精通特征工程
70
5
5.1.4
 各种分类变量编码的优缺点
one-hot
编码、虚拟编码和效果编码彼此之间非常相似,它们都有各自的优缺点。
one-hot
编码有冗余,这会使得同一个问题有多个有效模型,这种非唯一性有时候比较难以解释。
它的优点是每个特征都明确对应一个类别,而且可以把缺失数据编码为全零向量,模型输
出也是目标变量的总体均值。
虚拟编码和效果编码没有冗余,它们可以生成唯一的可解释的模型。虚拟编码的缺点是不
太容易处理缺失数据,因为全零向量已经映射为参照类了。它还会将每个类别的效果表示
为与参照类的相对值,这看上去有点不直观。
效果编码使用另外一种编码表示参照类,从而避免了这个问题,但是全由
-
1
组成的向量
是个密集向量,计算和存储的成本都比较高。正是因为这个原因,像
Pandas
scikit-learn
这样的常用机器学习软件包更喜欢使用虚拟编码或
one-hot
编码,而不是效果编码。
当类别的数量变得非常大时,这
3
种编码方式都会出现问题,所以需要另外的策略来处理
超大型分类变量。
5.2
 处理大型分类变量
互联网上的自动数据采集可以生成大型分类变量,在定向广告和欺诈检测这样的应用中,
这种情况非常常见。
在定向广告应用中,我们的任务是为一个用户匹配一组广告。这时的特征包括用户
ID
、广
告的站点域名、查询语句、当前页以及这些特征的所有成对组合。(查询语句是一个文本
字符串,可以被分解转换成一般的文本特征。但是,查询语句一般很短,而且通常由短语
组成,所以这时最好的做法是保持它们原封不动或者通过一个散列函数来传递,以使得存 ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

AirBnbBlueOriginElectronic ArtsHomeDepotNasdaqRakutenTata Consultancy Services

QuotationMarkO’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.
Julian F.
Head of Cybersecurity
QuotationMarkI wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.
Addison B.
Field Engineer
QuotationMarkI’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.
Amir M.
Data Platform Tech Lead
QuotationMarkI'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.
Mark W.
Embedded Software Engineer

You might also like

精通機器學習

精通機器學習

Aurélien Géron

Publisher Resources

ISBN: 9787115509680