Skip to Content
Spark机器学习实战
book

Spark机器学习实战

by Posts & Telecom Press, Siamak Amirghodsi, Meenakshi Rajendran, Broderick Hall, Shuen Mei
May 2024
Beginner to intermediate
549 pages
8h 11m
Chinese
Packt Publishing
Content preview from Spark机器学习实战

第12章 使用Spark 2.0 ML库实现文本分析

在这一章中,我们将讨论以下内容:

  • 用Spark统计词频;
  • 使用Spark和Word2Vec查找相似词;
  • 下载维基百科的全部语料数据,构建一个真实的Spark机器学习项目;
  • 使用Spark 2.0和潜在语义分析实现文本分析;
  • 使用Spark 2.0和潜在狄利克雷实现主题模型。

文本分析属于机器学习、数学、语言学和自然语言处理的交叉内容。文本分析(在旧文献中称为文本挖掘)试图从非结构化和半结构化数据中提取信息,并推断出更高级别的概念、情感和语义细节。值得注意的是,传统的关键字搜索方法无法有效地处理存在噪音、二义性和不相关的标记和概念,而这些在实际上下文中需要过滤掉。

从根本上来说,所要做的是针对一组给定的文档(文本、推文、网络和社交媒体),确定文档想要表达的要点,以及文档试图传达的概念(主题和概念)。仅仅将文档分解为不同部分和不同类别的方法过于原始,不能被视为文本分析。我们还可以做得更好。

Spark提供了一套工具和方法来简化文本分析,用户可以将这些技术结合起来构建一个可行的系统(例如,KKN模型和主题模型的结合)。

值得一提的是,目前有许多商用系统可以提供一组技术组合方案来解决最终问题。尽管Spark拥有很多适合处理大规模数据的工具集,但不难想象,任一文本分析系统也可以采用图模型(比如GraphFrame、GraphX)。图12-1是Spark针对文本分析所提供的工具和方法的简述。

图片 1

图12-1

文本分析是未来的一个重要领域,在安全、客户互动、情感分析、社交媒体和在线学习等许多领域有重要应用,如图12-2所示。使用文本分析技术,可以将传统数据存储(结构化数据和数据库表)与非结构化数据(客户评论、情绪和社交媒体交互)结合起来,以得到更高阶的理解和更完整的业务单元视图,这在以前是无法实现的。在选择社交媒体和非结构化文本作为主要交流方式的新时代,上述这一点尤为重要。 ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

AirBnbBlueOriginElectronic ArtsHomeDepotNasdaqRakutenTata Consultancy Services

QuotationMarkO’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.
Julian F.
Head of Cybersecurity
QuotationMarkI wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.
Addison B.
Field Engineer
QuotationMarkI’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.
Amir M.
Data Platform Tech Lead
QuotationMarkI'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.
Mark W.
Embedded Software Engineer

You might also like

TensorFlow深度学习项目实战

TensorFlow深度学习项目实战

Posts & Telecom Press, Luca Massaron, Alberto Boschetti, Alexey Grigorev, Abhishek Thakur
Python和NLTK实现自然语言处理

Python和NLTK实现自然语言处理

Posts & Telecom Press, Nitin Hardeniya
Python计算机视觉和自然语言处理

Python计算机视觉和自然语言处理

Posts & Telecom Press, Álvaro Morena Alberolaï, Gonzalo Molina Gallegoï, Unai Garay Maestreï
数据科学实战手册

数据科学实战手册

Posts & Telecom Press, Tony Ojeda, Sean Patrick Murphy, Bengfort Benjamin

Publisher Resources

ISBN: 9781836201830