The skew problem
Distributed systems just like teams of people working on an activity perform at the most optimum level when the work is evenly distributed among all the members of the team or the cluster. Both suffer, if the work is unevenly distributed and the system performs only as fast as the slowest component.
In the case of Spark, data is distributed across the cluster. You might have come across cases where a map job runs fairly quickly by your joins or shuffles take an excessive time. In most real life cases you would have popular keys or null values in your data, which would result in some tasks getting more work than others, thus resulting in a system skew. In the database world, original keys would actually be used to create new keys ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Read now
Unlock full access