© The Author(s), under exclusive license to APress Media, LLC, part of Springer Nature 2023
A. TestasDistributed Machine Learning with PySparkhttps://doi.org/10.1007/978-1-4842-9751-3_15

15. k-Means Clustering with Pandas, Scikit-Learn, and PySpark

Abdelaziz Testas1  
(1)
Fremont, CA, USA
 

In this chapter, we delve into the process of building, training, and evaluating a k-means clustering algorithm for effective data segmentation. Clustering is a commonly used technique in segmentation analysis to group similar observations together based on their characteristics or their proximity in the feature space. The result is a set of clusters, with each observation assigned to a specific cluster. By organizing data into clusters, we can gain a deeper understanding ...

Get Distributed Machine Learning with PySpark: Migrating Effortlessly from Pandas and Scikit-Learn now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.