Skip to Content
精通特征工程
book

精通特征工程

by Alice Zheng, Amanda Casari
April 2019
Intermediate to advanced
172 pages
4h 39m
Chinese
Posts & Telecom Press
Content preview from 精通特征工程
分类变量:自动化时代的数据计数
71
就在一个贝叶斯
probit
回归模型(可用简单的更新在线训练)中使用了这种二值特征。与
此同时,其他一些研究小组则致力于特征压缩方法。
Yahoo!
的研究者们对特征散列化推
崇备至(
Weinberger
等,
2009
),但是
McMahan
等人在
Google
广告引擎上试验了特征散
列化,却没有取得什么显著的进展。而微软的其他研究者又在实践分箱计数这种思想了
Bilenko, 2015
)。
正如我们将看到的,所有方法都有各自的优点和缺点。我们先介绍方法本身,然后再讨论
它们的利弊得失。
5.2.1
 特征散列化
散列函数
是一种确定性函数,它可以将一个可能无界的整数映射到一个有限的整数范围
[1,
m
]
中。因为输入域可能大于输出范围,所以可能有多个值被映射为同样的输出,这称
碰撞
均匀散列函数
可以确保将大致相同数量的数值映射到
m
个分箱中。
我们可以形象地将散列函数想象为一台机器,它吸入一些带数字标号的圆球(键),再把
它们分发到
m
个分箱中。标有同样数字的球总是被分发到同一个分箱中(见图
5-1
)。散列
函数在保持特征空间的同时,又可以在机器学习的训练和评价周期中减少存储空间和处理
时间。
00
01
mama
散列函数
(是数学,不是魔术)
散列值
dada
bottle
flags
banana
doggy
bubble
02
03
04
05
...
25
26
5-1:散列函数可以将键映射到分箱
我们可以为任何能表示为数值的对象(也就是任何能存储在计算机上的数据)构造散列函
数,这些对象包括数值、字符串、复杂结构,等等。 ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

AirBnbBlueOriginElectronic ArtsHomeDepotNasdaqRakutenTata Consultancy Services

QuotationMarkO’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.
Julian F.
Head of Cybersecurity
QuotationMarkI wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.
Addison B.
Field Engineer
QuotationMarkI’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.
Amir M.
Data Platform Tech Lead
QuotationMarkI'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.
Mark W.
Embedded Software Engineer

You might also like

精通機器學習

精通機器學習

Aurélien Géron

Publisher Resources

ISBN: 9787115509680