Skip to Content
精通特征工程
book

精通特征工程

by Alice Zheng, Amanda Casari
April 2019
Intermediate to advanced
172 pages
4h 39m
Chinese
Posts & Telecom Press
Content preview from 精通特征工程
44
3
字符串对象
并不像你看到的那么简单
字符串对象有多种编码方式,比如
ASCII
Unicode
。纯英文文本可以用
ASCII
进行编码,但多数其他语言需要使用
Unicode
。如果文档中包括非
ASCII
字符,就要确保分词程序可以处理相应的编码方式。否则,分词结果
就会出现错误。
3.3.2
 通过搭配提取进行短语检测
通过标记序列可以立刻得到单词和
n
元词列表。但是,从语义上说,我们更习惯于理解
短语,而不是
n
元词。在计算机自然语言处理(
NLP
)中,有用短语的概念被称为
搭配
collocation
)。
Manning
Schütze
1999: 151
)的话来说:“搭配是一种表达方式,它
由两个或两个以上的单词组成,并对应于某种约定俗成的事物说明。”
搭配能表达的意义比组成它的各个单词的总和还要多。例如,“
strong tea
”的意义绝对不
止“
great physical strength
”和“
tea
”,因此可以认为它是个搭配。另一方面,短语“
cute
puppy
”的意义则就是两个单词“
cute
”和“
puppy
”之和,因此我们认为它不是个搭配。
搭配不一定是个连贯的序列。例如,可以认为句子“
Emma knocked on the door
”包含搭配
knock door
”,因此,不是所有的搭配都是
n
元词。反之,也不是所有
n
元词都一定是有
意义的搭配。
因为搭配的意义比组成它的各个单词的总和要多,所以单词计数不能恰当地表示出它的意
义。这时用词袋来表示就力不从心了,用
n
元词袋表示也有问题,因为 ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

AirBnbBlueOriginElectronic ArtsHomeDepotNasdaqRakutenTata Consultancy Services

QuotationMarkO’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.
Julian F.
Head of Cybersecurity
QuotationMarkI wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.
Addison B.
Field Engineer
QuotationMarkI’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.
Amir M.
Data Platform Tech Lead
QuotationMarkI'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.
Mark W.
Embedded Software Engineer

You might also like

精通機器學習

精通機器學習

Aurélien Géron

Publisher Resources

ISBN: 9787115509680