O'Reilly logo

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Apache Flume: Distributed Log Collection for Hadoop - Second Edition

Book Description

Design and implement a series of Flume agents to send streamed data into Hadoop

In Detail

Apache Flume is a distributed, reliable, and available service used to efficiently collect, aggregate, and move large amounts of log data. It is used to stream logs from application servers to HDFS for ad hoc analysis.

This book starts with an architectural overview of Flume and its logical components. It explores channels, sinks, and sink processors, followed by sources and channels. By the end of this book, you will be fully equipped to construct a series of Flume agents to dynamically transport your stream data and logs from your systems into Hadoop.

A step-by-step book that guides you through the architecture and components of Flume covering different approaches, which are then pulled together as a real-world, end-to-end use case, gradually going from the simplest to the most advanced features.

What You Will Learn

  • Understand the Flume architecture, and also how to download and install open source Flume from Apache
  • Follow along a detailed example of transporting weblogs in Near Real Time (NRT) to Kibana/Elasticsearch and archival in HDFS
  • Learn tips and tricks for transporting logs and data in your production environment
  • Understand and configure the Hadoop File System (HDFS) Sink
  • Use a morphline-backed Sink to feed data into Solr
  • Create redundant data flows using sink groups
  • Configure and use various sources to ingest data
  • Inspect data records and move them between multiple destinations based on payload content
  • Transform data en-route to Hadoop and monitor your data flows

Downloading the example code for this book. You can download the example code files for all Packt books you have purchased from your account at http://www.PacktPub.com. If you purchased this book elsewhere, you can visit http://www.PacktPub.com/support and register to have the files e-mailed directly to you.

Table of Contents

  1. Apache Flume: Distributed Log Collection for Hadoop Second Edition
    1. Table of Contents
    2. Apache Flume: Distributed Log Collection for Hadoop Second Edition
    3. Credits
    4. About the Author
    5. About the Reviewers
    6. www.PacktPub.com
      1. Support files, eBooks, discount offers, and more
        1. Why subscribe?
        2. Free access for Packt account holders
    7. Preface
      1. What this book covers
      2. What you need for this book
      3. Who this book is for
      4. Conventions
      5. Reader feedback
      6. Customer support
        1. Downloading the example code
        2. Errata
        3. Piracy
        4. Questions
    8. 1. Overview and Architecture
      1. Flume 0.9
      2. Flume 1.X (Flume-NG)
      3. The problem with HDFS and streaming data/logs
      4. Sources, channels, and sinks
      5. Flume events
        1. Interceptors, channel selectors, and sink processors
        2. Tiered data collection (multiple flows and/or agents)
      6. The Kite SDK
      7. Summary
    9. 2. A Quick Start Guide to Flume
      1. Downloading Flume
        1. Flume in Hadoop distributions
      2. An overview of the Flume configuration file
      3. Starting up with "Hello, World!"
      4. Summary
    10. 3. Channels
      1. The memory channel
      2. The file channel
      3. Spillable Memory Channel
      4. Summary
    11. 4. Sinks and Sink Processors
      1. HDFS sink
        1. Path and filename
        2. File rotation
      2. Compression codecs
      3. Event Serializers
        1. Text output
        2. Text with headers
        3. Apache Avro
        4. User-provided Avro schema
        5. File type
          1. SequenceFile
          2. DataStream
          3. CompressedStream
        6. Timeouts and workers
      4. Sink groups
        1. Load balancing
        2. Failover
      5. MorphlineSolrSink
        1. Morphline configuration files
        2. Typical SolrSink configuration
        3. Sink configuration
      6. ElasticSearchSink
        1. LogStash Serializer
        2. Dynamic Serializer
      7. Summary
    12. 5. Sources and Channel Selectors
      1. The problem with using tail
      2. The Exec source
      3. Spooling Directory Source
      4. Syslog sources
        1. The syslog UDP source
        2. The syslog TCP source
        3. The multiport syslog TCP source
      5. JMS source
      6. Channel selectors
        1. Replicating
        2. Multiplexing
      7. Summary
    13. 6. Interceptors, ETL, and Routing
      1. Interceptors
        1. Timestamp
        2. Host
        3. Static
        4. Regular expression filtering
        5. Regular expression extractor
        6. Morphline interceptor
        7. Custom interceptors
          1. The plugins directory
      2. Tiering flows
        1. The Avro source/sink
          1. Compressing Avro
          2. SSL Avro flows
        2. The Thrift source/sink
        3. Using command-line Avro
        4. The Log4J appender
        5. The Log4J load-balancing appender
      3. The embedded agent
        1. Configuration and startup
        2. Sending data
        3. Shutdown
      4. Routing
      5. Summary
    14. 7. Putting It All Together
      1. Web logs to searchable UI
        1. Setting up the web server
          1. Configuring log rotation to the spool directory
        2. Setting up the target – Elasticsearch
        3. Setting up Flume on collector/relay
        4. Setting up Flume on the client
        5. Creating more search fields with an interceptor
        6. Setting up a better user interface – Kibana
      2. Archiving to HDFS
      3. Summary
    15. 8. Monitoring Flume
      1. Monitoring the agent process
        1. Monit
        2. Nagios
      2. Monitoring performance metrics
        1. Ganglia
        2. Internal HTTP server
        3. Custom monitoring hooks
      3. Summary
    16. 9. There Is No Spoon – the Realities of Real-time Distributed Data Collection
      1. Transport time versus log time
      2. Time zones are evil
      3. Capacity planning
      4. Considerations for multiple data centers
      5. Compliance and data expiry
      6. Summary
    17. Index