Skip to Content
精通特征工程
book

精通特征工程

by Alice Zheng, Amanda Casari
April 2019
Intermediate to advanced
172 pages
4h 39m
Chinese
Posts & Telecom Press
Content preview from 精通特征工程
文本数据:扁平化、过滤和分块
43
u'flower'
>>> stemmer.stem('zeroes')
u'zero'
>>> stemmer.stem('stemmer')
u'stem'
>>> stemmer.stem('sixties')
u'sixti'
>>> stemmer.stem('sixty')
u'sixty'
>>> stemmer.stem('goes')
u'goe'
>>> stemmer.stem('go')
u'go'
词干提取确实有一些计算成本,最终的收益能否超过成本要视具体应用而定。值得注意的
是,使用词干提取可能得不偿失。“
new
”和“
news
”具有非常不同的意义,但都会被提取
为“
new
”。同样的例子还有不少。基于这个原因,词干提取并不是非做不可。
3.3
 意义的单位
从单词
n
元词到短语
词袋的概念通俗易懂,但计算机怎么知道什么是一个单词呢?一个文本文档的数字化表示
就是一个字符串,也就是一个字符序列。我们还会遇到一些半结构化文档,比如
JSON
符串或
HTML
页面。但即使添加了标记和结构,文本的基本单位还是字符串。我们如何将
字符串转换为一个单词序列呢?这就需要文本的
解析
分词
技术,下面就来讨论一下。
3.3.1
 解析与分词
当字符串不只包含纯文本时,解析就是必须的。例如,如果原始数据是网页、电子邮件或
某种日志,那么其中就含有其他结构。我们需要确定如何处理标记、头部和尾部,以及日
志中我们不感兴趣的部分。如果文档是个网页,那么解析程序还需要处理 ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

AirBnbBlueOriginElectronic ArtsHomeDepotNasdaqRakutenTata Consultancy Services

QuotationMarkO’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.
Julian F.
Head of Cybersecurity
QuotationMarkI wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.
Addison B.
Field Engineer
QuotationMarkI’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.
Amir M.
Data Platform Tech Lead
QuotationMarkI'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.
Mark W.
Embedded Software Engineer

You might also like

精通機器學習

精通機器學習

Aurélien Géron

Publisher Resources

ISBN: 9787115509680