Introduction to PySpark
Python is one of the most popular and general purpose programming languages with a number of exciting features for data processing and machine learning tasks. To use Spark from Python, PySpark was initially developed as a lightweight frontend of Python to Apache Spark and using Spark's distributed computation engine. In this chapter, we will discuss a few technical aspects of using Spark from Python IDE such as PyCharm.
Many data scientists use Python because it has a rich variety of numerical libraries with a statistical, machine learning, or optimization focus. However, processing large-scale datasets in Python is usually tedious as the runtime is single-threaded. As a result, data that fits in the main memory can ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Read now
Unlock full access