book

Python Machine Learning By Example - Second Edition

by Yuxi (Hayden) Liu

February 2019

Beginner to intermediate

382 pages

10h 1m

English

Packt Publishing

Read now

Unlock full access

Content preview from Python Machine Learning By Example - Second Edition

Programming in PySpark

This section provides a quick introduction to programming with Python in Spark. We will start with the basic data structures in Spark.

Resilient Distributed Datasets (RDD) is the primary data structure in Spark. It is a distributed collection of objects and has the following three main features:

Resilient: When any node fails, affected partitions will be reassigned to healthy nodes, which makes Spark fault-tolerant
Distributed: Data resides on one or more nodes in a cluster, which can be operated on in parallel
Dataset: This contains a collection of partitioned data with their values or metadata

RDD was the main data structure in Spark before version 2.0. After that, it is replaced by the DataFrame , which is also ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Start your free trial

Python Machine Learning by Example - Third Edition

Yuxi (Hayden) Liu

Python Machine Learning

Wei-Meng Lee

Machine Learning with Python Cookbook

Chris Albon

Python Machine Learning, Second Edition - Second Edition

Sebastian Raschka, Jared Huffman, Vahid Mirjalili, Ryan Sun

Publisher Resources

ISBN: 9781789616729Supplemental Content