Skip to Content
Python文本分析
book

Python文本分析

by Jens Albrecht, Sidharth Ramachandran, Christian Winkler
August 2022
Intermediate to advanced
441 pages
11h 26m
Chinese
China Electric Power Press Ltd.
Content preview from Python文本分析
314
10
10.5.1
案例:降维
我们可以通过将数据投影到二维或三维的方式对高维向量进行可视化。
如果投影效果很好的话,可以直观地看到相关的词语聚类,而且还能更
深入地理解语料库的语义概念。我们将通过窗口大小为
30
的模型(该模型更倾向于
句法关系),寻找相关单词的聚类,并探索特定关键词的语义相邻关系。因此,我
们希望与宝马相关的单词形成一个“
BMW
”聚类,丰田相关的单词形成一个“
Toyota
聚类等。
降维在机器学习领域有很多用途。有些学习算法不善于处理高维,而且往往也无法
处理稀疏数据。
PCA
t-SNE
UMAP
(请参见如下“降维技术”的介绍)等降维
技术会设法通过投影保留或凸显数据分布的重要方面。基本思想是对数据进行投影,
确保高维空间内彼此距离很近的对象在投影中也很近,相距较远的对象依然远离。
在这个示例中,我们使用了
UMAP
算法,因为它提供的可视化效果最好。但是,
由于
umap
实现了
scikit-learn
estimator
接口,所以你也可以将
UMAP
换成
scikit-
learn
PCA
TSNE
类。
降维技术
降维的算法有很多种。可视化常用的算法包括 PCAt-SNE UMAP
主成分分析(Principal Component Analysis,简称 PCA)可以实现数据的线性
投影,数据点中的绝大多数方差都会被保留下来。从数学的角度来说,它的基
础是协方差矩阵(主成分)中特征值最大的特征向量。PCA 只能考虑整体的数
据分布。它不会考虑局部结构,所有的数据点都按照同一种方式进行转换。除 ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

AirBnbBlueOriginElectronic ArtsHomeDepotNasdaqRakutenTata Consultancy Services

QuotationMarkO’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.
Julian F.
Head of Cybersecurity
QuotationMarkI wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.
Addison B.
Field Engineer
QuotationMarkI’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.
Amir M.
Data Platform Tech Lead
QuotationMarkI'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.
Mark W.
Embedded Software Engineer

You might also like

精益AI

精益AI

Lomit Patel
构建知识图谱

构建知识图谱

Jesus Barrasa, Jim Webber
写给系统管理员的Python脚本编程指南

写给系统管理员的Python脚本编程指南

Posts & Telecom Press, Ganesh Sanjiv Naik

Publisher Resources

ISBN: 9787519864446