Chapter 5. Dataproc on Google Kubernetes Engine
Kubernetes is an open source platform that is designed to automate deploying, scaling, and operating containerized applications and has widespread adoption. With Kubernetes, developers can deploy complex applications more quickly and ensure high availability, fault tolerance, and better resource utilization across their infrastructure.
In standard Dataproc clusters running on GCE, the YARN framework handles resource management. Like YARN, Kubernetes functions as a resource manager, capable of allocating resources to meet framework needs. While YARN is specifically designed for Hadoop ecosystems, though, Kubernetes offers a more general-purpose approach to container orchestration, extending its capabilities beyond big data workloads.
Deploying Dataproc on GCE requires managing the underlying hardware infrastructure, including VMs and networking. In contrast, Dataproc on GKE leverages the existing infrastructure of a GKE cluster, simplifying deployment and management. Dataproc on GKE clusters are essentially virtual clusters, allowing you to define which node pools to utilize without directly managing the underlying hardware.
It’s important to note that Dataproc on GKE currently focuses primarily on Spark workloads. For use cases involving Hive, HDFS, or other Hadoop components, you’ll need to opt for Dataproc on GCE. Table 5-1 compares Dataproc on GCE to Dataproc on GKE.
| Feature ... |
|---|
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Read now
Unlock full access