Skip to Content
精通特征工程
book

精通特征工程

by Alice Zheng, Amanda Casari
April 2019
Intermediate to advanced
172 pages
4h 39m
Chinese
Posts & Telecom Press
Content preview from 精通特征工程
50
3
('as', 'RB'),
('awful', 'JJ'),
('as', 'IN'),
('trying', 'VBG'),
('to', 'TO'),
('find', 'VB'),
('a', 'DT'),
('date', 'NN')]
>>> print([np for np in blob_df[4].noun_phrases])
['got', 'goldberg', 'arizona', 'new position', 'june', 'new doctor', 'nyc']
可见,通过不同程序库找出的名词短语并不完全一样。
spaCy
找出的短语中包括英文中的
一些普通词,如“
a
”和“
the
”,
TextBlob
则去掉了这些词。由此可知,不同程序库中
认定名词短语的规则引擎是有区别的。你还可以编写自己的词性关系来定义要搜寻的文本
块。你可以参考(
Bird
等,
2009
)来从头开始研究使用
Python
进行文本分块的方法。
3.4
 小结
词袋表示法简单易懂,容易计算,并对分类和搜索任务非常有效。但有时单个单词还是太
简单了,无法表述出文本中的某些信息。为了解决这个问题,我们要求助于更长的序列。
n
元词袋是词袋的一种自然推广,它的概念非常好理解,计算起来也和词袋一样容易。
n
元词袋可以生成大量互不相同的
n
元词,它增加了特征存储成本,在模型训练和预测阶
段也需要更多计算能力。对于同样数量的数据点,
n
元词袋使得特征空间的维度大大增加。
因此,数据变得特别稀疏。
n
越大,存储和计算的成本就越高 ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

AirBnbBlueOriginElectronic ArtsHomeDepotNasdaqRakutenTata Consultancy Services

QuotationMarkO’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.
Julian F.
Head of Cybersecurity
QuotationMarkI wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.
Addison B.
Field Engineer
QuotationMarkI’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.
Amir M.
Data Platform Tech Lead
QuotationMarkI'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.
Mark W.
Embedded Software Engineer

You might also like

精通機器學習

精通機器學習

Aurélien Géron

Publisher Resources

ISBN: 9787115509680