Skip to Content
精通特征工程
book

精通特征工程

by Alice Zheng, Amanda Casari
April 2019
Intermediate to advanced
172 pages
4h 39m
Chinese
Posts & Telecom Press
Content preview from 精通特征工程
文本数据:扁平化、过滤和分块
37
词袋并非完美无缺,将句子分解为单词会破坏语义。例如,“
not bad
”在语义上是
decent
”,甚至是“
good
”(特别是在英式英语里)。但“
not
”和“
bad
”被分开后表示的是
一种否定和负面的情感。“
toy dog
”和“
dog toy
”是差别很大的两种东西(除非是玩具狗
的狗玩具),但拆成“
toy
”和“
dog
”这两个单词后,都失去了原来的意义。我们还可以轻
松地举出很多其他的例子。我们随后要介绍的
n
元词袋可以在某种程度上解决这种问题,
但不是根本的解决方案。我们应该知道,词袋是一种简单而有效的启发式方法,但离正确
的文本语义理解还相去甚远。
3.1.2
 
n
元词袋
n
元词袋(
bag-of-n-grams
)是词袋的一种自然扩展。
n
-gram
n
元词)是由
n
个标记
token
)组成的序列。
1-gram
就是一个单词(
word
),又称为
一元词
unigram
)。经过
分词(
tokenization
)之后,计数机制会将单独标记转换为单词计数,或将有重叠的序列
作为
n
-gram
进行计数。例如,句子“
Emma knocked on the door
”会生成
n
-gram
Emma
knocked
”“
knocked on
”“
on the
”和“
the door
”。
n
-gram
能够更多地保留文本中的初始序列结构,因此
n
元词袋表示法可以表达更丰富的信
息。然而,这不是没有代价的。理论上,有
k
个不同的单词,就会有
k
2
个不同的
2-gram
(又称
二元词
)。实际上,没有这么多,因为不是每个单词都可以跟在另一个单词后面。尽 ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

AirBnbBlueOriginElectronic ArtsHomeDepotNasdaqRakutenTata Consultancy Services

QuotationMarkO’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.
Julian F.
Head of Cybersecurity
QuotationMarkI wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.
Addison B.
Field Engineer
QuotationMarkI’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.
Amir M.
Data Platform Tech Lead
QuotationMarkI'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.
Mark W.
Embedded Software Engineer

You might also like

精通機器學習

精通機器學習

Aurélien Géron

Publisher Resources

ISBN: 9787115509680