Skip to Content
精通特征工程
book

精通特征工程

by Alice Zheng, Amanda Casari
April 2019
Intermediate to advanced
172 pages
4h 39m
Chinese
Posts & Telecom Press
Content preview from 精通特征工程
分类变量:自动化时代的数据计数
79
t
0
t t
n
用于计算分箱计
数统计量的数据
用于测试的数据用于训练的数据
5-5:使用时间窗口可以防止分箱计数过程中的数据泄露
还有一种基于差分隐私的解决方案。对于一个统计量,如果不管有没有任何一个数据点,
它的分布都保持基本不变,那么它就是
近似防漏
的。实际上,使用
Laplace(0,1)
分布添加
一个小的随机噪声,就足以弥补任何来自单数据点的潜在泄露。这种思想可以和留一计数
方法结合起来,构成用于当前数据的统计量(
Zhang, 2015
)。
3.
无界计数
如果提供的历史数据越来越多,统计量持续更新,计数就会无限增长。这对于模型来说是
个问题。一个训练好的模型应该“知道”输入数据的可见范围。训练好的决策树可以这样
表述:“当
x
大于
3
时,预测值为
1
。”训练好的线性模型可以这样表述:“将
x
乘以
0.7
然后看看结果是否大于全局平均数。”当
x
位于
0
5
之间时,这些可能是正确的决策。
但如果超出这个范围呢?没有人知道。
当输入计数增加时,模型需要维持原来的规模。如果计数累积得比较慢,有效范围不会变
得太快,模型就不需要维护得特别频繁。但当计数增加得非常快时,过于频繁的维护就会
造成很多问题。
由于这个原因,通常更好的做法是使用归一化后的计数,这样就可以保证把计数值限制在
一个可知的区间中。例如,点击率的估计值被限制在
[0, 1]
这一范围。另一种方法是进行
对数变换,这样可以强加一个严格的边界,但当计数值非常大时,变换结果的增加速度是
非常慢的。
这两种方法都不能保证输入分布保持不变(例如,去年的芭比娃娃已经过时了 ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

AirBnbBlueOriginElectronic ArtsHomeDepotNasdaqRakutenTata Consultancy Services

QuotationMarkO’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.
Julian F.
Head of Cybersecurity
QuotationMarkI wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.
Addison B.
Field Engineer
QuotationMarkI’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.
Amir M.
Data Platform Tech Lead
QuotationMarkI'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.
Mark W.
Embedded Software Engineer

You might also like

精通機器學習

精通機器學習

Aurélien Géron

Publisher Resources

ISBN: 9787115509680