Chapter 8. Planning, Deploying, and Monitoring Flume
Up to this point, we’ve discussed the architecture of Flume and the various components and their configuration. In this chapter, we will discuss how to plan a Flume deployment and how to deploy and monitor Flume agents. We will also discuss the various tools available outside of the Flume project itself that make deployment and monitoring of Flume easier.
Planning a Flume Deployment
Planning a Flume deployment can be tricky. In this section, we will discuss the steps involved in planning a Flume deployment for your requirements.
Time to Repair
Most production deployments define a mean time to repair (MTTR) for systems that have gone down, which is usually a good estimate of how long systems will take to come back online. In this section, we will assume that the MTTR for servers hosting the various services is available and that in most cases the time required for recovery does not exceed this. In simple terms, let’s consider this to be the time taken for servers to recover from failure in most cases. This will vary between deployments, and if a maximum time to repair (maximum time in which a failed system recovers) is available, that should be considered instead. In this chapter, we assume that this is available; we’ll call it MaxTTR.
Now that the user already has information on (or has calculated) the maximum time that each machine can go down for, we also assume for the purposes of this chapter that any planned or unplanned ...