Skip to Content
精通特征工程
book

精通特征工程

by Alice Zheng, Amanda Casari
April 2019
Intermediate to advanced
172 pages
4h 39m
Chinese
Posts & Telecom Press
Content preview from 精通特征工程
78
5
函数的输出范围
m
都远远小于类别数量
k
。在计算统计量时,需要使用所有散列函数进行
计算,并返回结果中最小的那个统计量。与使用单散列函数相比,使用多个散列函数可以
降低碰撞概率。这种方法的有效之处在于,散列函数的数量乘以散列表大小
m
之后,不但
可以小于类别数量
k
,而且能保持非常低的碰撞概率。
5-4
演示了这个过程。对于每个项目
i
,都把它映射到计数数组每一行中的某个单元。当
项目
i
t
的计数
c
t
更新时,就使用函数
h
1
h
d
进行散列,添加到每个单元中。
h
1
h
d
c
t
i
t
c
t
c
t
c
t
5-4:最小计数图
2.
防止数据泄露
因为分箱计数要依赖历史数据生成必需的统计量,所以它需要等待一段时间以完成数据收
集,这就会在学习流程中导致一点轻微的延迟。还有,当数据分布改变时,需要更新计
数。数据变化得越快,计数重新计算的频率就越高。在像定向广告这样的应用中,用户偏
好和常用查询变化得非常快,所以这个问题变得特别重要,不能适应当前数据分布的变化
意味着广告平台的巨大损失。
有人或许会问:为什么不使用同样的数据集来计算相关统计量和训练模型?这种想法太天
真了。这里最大的问题是,统计量中包含目标变量,而它正是模型试图去预测的。使用输
出去计算输入特征会导致一个非常严重的问题,那就是
数据泄露
。简单地说,数据泄露会
使模型中包含一些不应该有的信息,这些信息可以使模型获得某种不现实的优势,从而做
出更加精确的预测。出现数据泄露有多种原因,比如测试数据泄露到训练数据中,或者未
来数据泄露到过去数据中。只要模型获得了在生产环境中实时预测时不应该接触到的信 ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

AirBnbBlueOriginElectronic ArtsHomeDepotNasdaqRakutenTata Consultancy Services

QuotationMarkO’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.
Julian F.
Head of Cybersecurity
QuotationMarkI wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.
Addison B.
Field Engineer
QuotationMarkI’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.
Amir M.
Data Platform Tech Lead
QuotationMarkI'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.
Mark W.
Embedded Software Engineer

You might also like

Kafka权威指南

Kafka权威指南

Neha Narkhede, Gwen Shapira, Todd Palino
精通機器學習

精通機器學習

Aurélien Géron

Publisher Resources

ISBN: 9787115509680