Skip to Content
精通特征工程
book

精通特征工程

by Alice Zheng, Amanda Casari
April 2019
Intermediate to advanced
172 pages
4h 39m
Chinese
Posts & Telecom Press
Content preview from 精通特征工程
4
2
简单而又奇妙的数值
在研究诸如文本和图像这样的复杂数据类型之前,我们先看看最简单的数据类型:数值。
数值型数据有多种来源:某个人或某个地点的地理位置、一宗交易的价格、传感器中的测
量数据、交通流量,等等。尽管数值型数据已经很容易被数学模型所使用了,但并不意味
着不需要进行特征工程。好的特征不仅能够表示出数据的主要特点,还应该符合模型的假
设,因此通常必须进行数据转换。数值型数据的特征工程技术是非常基本的,只要原始数
据被转换成数值型特征,就可以应用这些技术。
要对数值型数据进行合理性检查,首先要看看它的量级。我们只需知道它是正的还是负
的,还是只需在一个很粗的粒度上知道它的量级?这种合理性检查非常重要,尤其是对于
那些自动产生的数值,比如计数(网站的每日访问量、餐馆的评价数量等)。
然后,还要考虑一下特征的尺度。它的最大值和最小值是多少?是否横跨多个数量级?如
果模型是输入特征的平滑函数,那么它对输入的尺度是非常敏感的。例如,
3
x
+ 1
是输
x
的一个简单线性函数,它的输出尺度直接取决于输入尺度。此外,
k
-
均值聚类、最近
邻方法、径向基核函数,以及所有使用欧氏距离的方法都属于这种情况。对于这类模型和
模型成分,通常需要对特征进行
标准化
,以便将输出控制在期望的范围之内。
相反,逻辑函数对输入特征的尺度并不敏感。无论输入如何,这种函数的输出总是一个二
值变量。例如,逻辑操作
AND
接受两个变量,当且仅当这两个输入都为真时,才输
1
。逻辑函数的另一个例子是阶梯函数(如:输入
x
是否大于
5
?)。决策树模型中使用 ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

AirBnbBlueOriginElectronic ArtsHomeDepotNasdaqRakutenTata Consultancy Services

QuotationMarkO’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.
Julian F.
Head of Cybersecurity
QuotationMarkI wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.
Addison B.
Field Engineer
QuotationMarkI’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.
Amir M.
Data Platform Tech Lead
QuotationMarkI'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.
Mark W.
Embedded Software Engineer

You might also like

Kafka权威指南

Kafka权威指南

Neha Narkhede, Gwen Shapira, Todd Palino
精通機器學習

精通機器學習

Aurélien Géron

Publisher Resources

ISBN: 9787115509680