Chapter 7. Other Bottlenecks in Distributed Systems

Introduction

Whereas the previous few chapters have covered specific low-level hardware resources that can cause performance bottlenecks, distributed systems often have a number of components that work together, and any one of these components can become a bottleneck for the entire system. Also, in some cases, what is presented to a programmer or user as a single distributed system can in fact be one distributed system built on top of another—for example, HBase runs on HDFS—and so can suffer from bottlenecks in the underlying system that are not immediately visible and require additional monitoring.

This chapter focuses on performance and management challenges for Hadoop in particular, but many of the same concepts apply to other distributed systems—only the specific components differ.

NameNode Contention

HDFS uses an architecture with a single NameNode that stores the directory tree of all files in the distributed file system and tracks which nodes hold each file. The NameNode process stores information in memory about every HDFS file, including metadata for the file, the blocks the file is broken into, and the nodes that store the HDFS blocks. As a result, the NameNode on large clusters might store 300 bytes of data for each of many millions of HDFS files and blocks,1 and thus its memory usage can be large and grow over time. If the memory needed exceeds the available physical memory on the node, the NameNode can start swapping ...

Get Effective Multi-Tenant Distributed Systems now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.