Skip to Content
精通特征工程
book

精通特征工程

by Alice Zheng, Amanda Casari
April 2019
Intermediate to advanced
172 pages
4h 39m
Chinese
Posts & Telecom Press
Content preview from 精通特征工程
42
3
原始文本
带垃圾箱的
词袋向量
I like puppies.
I like cats.
I like gobbledygook
and zylophant.
I
3
like 3
puppies 1
cats 1
and 1
GARBAGE
2
3-7:带垃圾箱的词袋特征向量
因为在对整个语料库进行计数统计之前,我们不知道哪些词是罕见的,所以垃圾箱特征只
能在后处理阶段进行收集。
既然本书讲述的是特征工程,我们的关注点肯定在于特征。但罕见性这个概念也可以应用
在数据点上。如果一个文本文档非常短,那么它很可能不会包含什么有价值的信息,在训
练模型时不应该使用它。但是,在应用这条原则时一定要小心。
Wikipedia dump
语料库中
有很多页还是未完成状态,过滤掉这些页应该是很安全的。推文则是另一种情况,它天生
简短,需要专门的特征化和建模技巧。
3.2.3
 词干提取
文本的简单解析有一个问题,就是同一个单词的各种变体会被视为不同的词而分别计数。
如 ,“
flower
”和“
flowers
”在技术上是两个不同的标记,“
swimmer
”“
swimming
”和
swim
”也是一样的情况,尽管它们的含义非常相近。如果这些不同变体能映射为同一单
词,那文本解析的效果会更好。
词干提取是一种将每个单词转换为语言学中的基本词干形式的
NLP
技术。词干提取有多种
方法,有的基于语言学规则,有的基于统计观测。有一种算法子类综合了词性标注和语言
规则,这种处理过程称为词形还原。
多数词干提取工具都将英语作为重点,但针对其他语言的工具也在蓬勃发展 ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

AirBnbBlueOriginElectronic ArtsHomeDepotNasdaqRakutenTata Consultancy Services

QuotationMarkO’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.
Julian F.
Head of Cybersecurity
QuotationMarkI wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.
Addison B.
Field Engineer
QuotationMarkI’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.
Amir M.
Data Platform Tech Lead
QuotationMarkI'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.
Mark W.
Embedded Software Engineer

You might also like

精通機器學習

精通機器學習

Aurélien Géron

Publisher Resources

ISBN: 9787115509680