Skip to Content
精通特征工程
book

精通特征工程

by Alice Zheng, Amanda Casari
April 2019
Intermediate to advanced
172 pages
4h 39m
Chinese
Posts & Telecom Press
Content preview from 精通特征工程
40
3
3.2.2
 基于频率的过滤
停用词列表是一种剔除形成无意义特征的单词的方法。还有一些更具统计性的方法可以找
出这些没有实际意义的单词。在搭配提取方法中,有一些依赖人工定义的方法,也有一些
运用统计学的方法。在单词过滤中,我们可以应用同样的思想,也可以使用频率统计。
1.
高频词
频率统计是一种非常强大的过滤技术,既可以过滤语料库专用的常见单词,也可以过滤通
用的停用词。例如,短语“
New York Times
”以及其中的每个单词在纽约时报注释语料库
数据集(
New York Times Annotated Corpus dataset
)中都频繁出现。同样,单词“
house
频繁出现在英国议会演讲语料库(
Hansard corpus
)中的短语“
House of Commons
”中,这
个语料库中的加拿大议会辩论数据集常用于统计机器翻译,因为它包括所有文件的英文和
法文版本。这些单词一般来说是有意义的,但在特定语料库中则不然。典型的停用词列表
会包括通用的停用词,但不包括语料库专用的停用词。
检查一下出现频率最高的单词,可以发现文本解析时的问题,并能标记出那些碰巧在语料
库中出现多次的通常有用的词。举例来说,表
3-1
列出了
Yelp
点评数据集中出现频率最高
40
个词。这里的频率指的是包含这个词的文档(点评)数,并不是它在一篇文档中出
现的次数。正如我们所看到的,这个列表中有很多停用词。我们还有一些意外的发现,列
表中有“
s
”和“
t
”,这是因为我们使用了撇号作为分词的分隔符,于是像“
Mary
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

AirBnbBlueOriginElectronic ArtsHomeDepotNasdaqRakutenTata Consultancy Services

QuotationMarkO’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.
Julian F.
Head of Cybersecurity
QuotationMarkI wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.
Addison B.
Field Engineer
QuotationMarkI’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.
Amir M.
Data Platform Tech Lead
QuotationMarkI'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.
Mark W.
Embedded Software Engineer

You might also like

精通機器學習

精通機器學習

Aurélien Géron

Publisher Resources

ISBN: 9787115509680