Skip to Content
Spark机器学习实战
book

Spark机器学习实战

by Posts & Telecom Press, Siamak Amirghodsi, Meenakshi Rajendran, Broderick Hall, Shuen Mei
May 2024
Beginner to intermediate
549 pages
8h 11m
Chinese
Packt Publishing
Content preview from Spark机器学习实战

第11章 大数据中的高维灾难

在本章中,我们将讨论以下内容:

  • Spark提取和准备CSV文件的2种处理方法;
  • Spark使用奇异值分解(Singular Value Decomposition,SVD)对高维数据降维;
  • Spark使用主成分分析(Principal Component Analysis,PCA)为机器学习挑选最有效的潜在因子。

高维灾难并不是一个新的术语或概念,该术语最早在R. Bellman处理动态规划问题(贝尔曼方程)时提出。在机器学习中,高维灾难是指:当增加维数(坐标轴或特征)时,训练数据(样本)的数目保持不变(或相对减少),导致预测准确率下降。这种现象也被称为休斯效应,以G. Hughes的名字命名,用于描述当向问题空间引入越来越多的维度时,搜索空间快速(指数)增长的现象。上述描述有点违反直觉,但是实际的确如此:如果样本数量的增长率和维度数目增长率不一致,那么实际模型的准确率也较低。

简而言之,绝大多数机器学习算法本质是基于统计学的,试图通过在训练期间对空间划分,并对每个子空间中每个类的数量进行某种计数,进而学习目标空间的属性。维度灾难是由越来越少的数据样本造成的,而数据样本可以帮助算法在增加更多维度时进行区分和学习。一般而言,如果有N个一维样本,那么在D维中需要(ND个样本才能保持样本密度不变。

例如,有10个二维(身高和体重)的病人数据,构成在二维平面上的10个数据点。如果引入其他的维度,例如地区、摄入卡路里量、种族、收入等,那么会发生什么?在这种情况下,还是仅有10个观察点(10个病人),但却对应6个维度的更大空间。当新的维度引入时,样本数据(用于训练)无法指数增长的问题称为维度灾难。

通过一个图形化的例子来展示搜索空间与数据样本的增长关系,图11-1表示在5×5(25个单元格)坐标轴上,展示了5个数据点的集合。当增加另一个维度时,预测准确度会发生什么变化?在三维空间的125个单元格中,仍然仅有5个数据点,这会导致大量的稀疏子空间,无法帮助机器学习算法更好地学习(或区分),因此导致算法准确性降低。 ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

AirBnbBlueOriginElectronic ArtsHomeDepotNasdaqRakutenTata Consultancy Services

QuotationMarkO’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.
Julian F.
Head of Cybersecurity
QuotationMarkI wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.
Addison B.
Field Engineer
QuotationMarkI’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.
Amir M.
Data Platform Tech Lead
QuotationMarkI'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.
Mark W.
Embedded Software Engineer

You might also like

TensorFlow深度学习项目实战

TensorFlow深度学习项目实战

Posts & Telecom Press, Luca Massaron, Alberto Boschetti, Alexey Grigorev, Abhishek Thakur
Python和NLTK实现自然语言处理

Python和NLTK实现自然语言处理

Posts & Telecom Press, Nitin Hardeniya
Python计算机视觉和自然语言处理

Python计算机视觉和自然语言处理

Posts & Telecom Press, Álvaro Morena Alberolaï, Gonzalo Molina Gallegoï, Unai Garay Maestreï
数据科学实战手册

数据科学实战手册

Posts & Telecom Press, Tony Ojeda, Sean Patrick Murphy, Bengfort Benjamin

Publisher Resources

ISBN: 9781836201830