Skip to Content
Spark机器学习实战
book

Spark机器学习实战

by Posts & Telecom Press, Siamak Amirghodsi, Meenakshi Rajendran, Broderick Hall, Shuen Mei
May 2024
Beginner to intermediate
549 pages
8h 11m
Chinese
Packt Publishing
Content preview from Spark机器学习实战

第8章 Spark 2.0的无监督聚类算法

在这一章中,我们将讨论以下内容:

  • 使用Spark 2.0构建KMeans分类系统;
  • 介绍Spark 2.0中的新算法——二分KMeans;
  • 在Spark 2.0中使用高斯混合和期望最大(EM)对数据分类;
  • 在Spark 2.0中使用幂迭代聚类(PIC)对图中的节点进行分类;
  • 使用隐狄利克雷分布(LDA)将文档和文本划分为不同主题;
  • 使用Streaming KMeans实现近实时的数据分类。

无监督机器学习是一种尝试从一组未打标的观察样本中直接或间接(通过隐因子)获取推断的技术。简单来说,无监督机器学习技术试图从一组数据中发现隐藏的知识或结构,无须对训练数据打标。

当用于大型数据集(迭代、来回反复计算、大量的中间写操作)时,大多数机器学习库会崩溃失效,借助于并行和大规模数据集的设计特性,Apache Spark机器学习库将中间数据写入内存,从而能够处理大型数据集。

从更抽象的层面来说,无监督学习可以划分几个部分。

  • 聚类系统:使用硬编码(样本属于单个类簇)或软编码(样本对应概率,样本同时属于多个类别),将输入数据分为多个类别。
  • 降维系统:使用原始数据的密集表示,发现数据的隐因子。

图8-1展示了机器学习技术的整个框架。前面的章节重点关注了监督机器学习技术,在本章将重点关注使用Spark ML/MLLIB库的无监督机器学习技术,包括聚类和隐因子模型。

图片 1

图8-1

通常使用类蔟内的相似性测量指标对类簇建模,例如使用欧式距离或概率。Spark提供了一套完整、高性能的算法,可以实现大规模的并行。Spark不仅提供API,还提供了完整的源代码,非常有助于开发者理解性能瓶颈和解决个性化的需求(如衍生到GPU)。 ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

AirBnbBlueOriginElectronic ArtsHomeDepotNasdaqRakutenTata Consultancy Services

QuotationMarkO’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.
Julian F.
Head of Cybersecurity
QuotationMarkI wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.
Addison B.
Field Engineer
QuotationMarkI’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.
Amir M.
Data Platform Tech Lead
QuotationMarkI'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.
Mark W.
Embedded Software Engineer

You might also like

TensorFlow深度学习项目实战

TensorFlow深度学习项目实战

Posts & Telecom Press, Luca Massaron, Alberto Boschetti, Alexey Grigorev, Abhishek Thakur
Python和NLTK实现自然语言处理

Python和NLTK实现自然语言处理

Posts & Telecom Press, Nitin Hardeniya
Python计算机视觉和自然语言处理

Python计算机视觉和自然语言处理

Posts & Telecom Press, Álvaro Morena Alberolaï, Gonzalo Molina Gallegoï, Unai Garay Maestreï
数据科学实战手册

数据科学实战手册

Posts & Telecom Press, Tony Ojeda, Sean Patrick Murphy, Bengfort Benjamin

Publisher Resources

ISBN: 9781836201830