Skip to Content
Python Machine Learning By Example - Second Edition
book

Python Machine Learning By Example - Second Edition

by Yuxi (Hayden) Liu
February 2019
Beginner to intermediate
382 pages
10h 1m
English
Packt Publishing
Content preview from Python Machine Learning By Example - Second Edition

Programming in PySpark

This section provides a quick introduction to programming with Python in Spark. We will start with the basic data structures in Spark.

Resilient Distributed Datasets (RDD) is the primary data structure in Spark. It is a distributed collection of objects and has the following three main features:

  • Resilient: When any node fails, affected partitions will be reassigned to healthy nodes, which makes Spark fault-tolerant
  • Distributed: Data resides on one or more nodes in a cluster, which can be operated on in parallel
  • Dataset: This contains a collection of partitioned data with their values or metadata

RDD was the main data structure in Spark before version 2.0. After that, it is replaced by the DataFrame , which is also ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Start your free trial

You might also like

Python Machine Learning by Example - Third Edition

Python Machine Learning by Example - Third Edition

Yuxi (Hayden) Liu
Python Machine Learning, Second Edition - Second Edition

Python Machine Learning, Second Edition - Second Edition

Sebastian Raschka, Jared Huffman, Vahid Mirjalili, Ryan Sun

Publisher Resources

ISBN: 9781789616729Supplemental Content