Skip to Content
精通特征工程
book

精通特征工程

by Alice Zheng, Amanda Casari
April 2019
Intermediate to advanced
172 pages
4h 39m
Chinese
Posts & Telecom Press
Content preview from 精通特征工程
62
4
恰当地调优可以提高所有特征集合的准确率,这三种特征集合经过正则化逻辑回归后,都
得到了相似的分类准确率。
tf-idf
模型的准确率稍稍高一些,但这种差别似乎不是统计显
著的。这些结果令我们非常困惑。如果特征缩放的效果并不比普通的词袋表示好,那它到
底有什么意义?如果
tf-idf
没有什么意义,那我们为什么还要这么大动干戈?在下一节中,
我们将试图回答这些问题。
4.3
 深入研究
发生了什么
为了弄清楚结果背后的原因,我们必须知道模型是如何使用特征的。对于逻辑回归这种线
性模型,这个过程是通过一个称为
数据矩阵
的中间对象实现的。
数据矩阵中包含有数据点,它们是由固定长度的扁平向量表示的。如果使用词袋向量,数
据矩阵又可以称为
文档
-
词矩阵
。图
3-1
展示了一个向量形式的词袋向量,图
4-1
演示了
特征空间中的
4
个词袋向量。要生成一个文档
-
词矩阵,只需得到文档向量,把它们放
平,再彼此叠加起来即可。这种矩阵的列表示词汇表中所有可能出现的单词(见图
4-5
)。
因为多数文档只包含所有可能出现的单词中的一小部分,所有矩阵中的多数元素都是
0
这是个
稀疏
矩阵。
1
1
1
0
1
it is a puppy
it is a kitten
it is a cat
that is a dog and this is a pen
it is a matrix
1
1
1
2
1
1
0
0
0
0
0
0
1
0
0
0
0
0
1
0
1
1
1
2
1
0
it is puppy cat pen a this
0
0
1
0
4-55 个文档 7 个单词的文档
-
词矩阵示例
特征缩放实质上是数据矩阵上的列操作。特别地 ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

AirBnbBlueOriginElectronic ArtsHomeDepotNasdaqRakutenTata Consultancy Services

QuotationMarkO’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.
Julian F.
Head of Cybersecurity
QuotationMarkI wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.
Addison B.
Field Engineer
QuotationMarkI’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.
Amir M.
Data Platform Tech Lead
QuotationMarkI'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.
Mark W.
Embedded Software Engineer

You might also like

精通機器學習

精通機器學習

Aurélien Géron

Publisher Resources

ISBN: 9787115509680