O'Reilly logo

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Learning Cascading

Book Description

Build reliable, robust, and high-performance big data applications using the Cascading application development efficiently

In Detail

Cascading is open source software that is used to create and execute complex data processing workflows on big data clusters. The book starts by explaining how Cascading relates to core big data technologies such as Hadoop MapReduce. Having instilled an understanding of the technology, the book provides a comprehensive introduction to the Cascading paradigm and its components using code examples. You will not only learn more advanced Cascading features, you will also write code to utilize them. Furthermore, you will gain in-depth knowledge of how to efficiently optimize a Cascading application. To deepen your knowledge and experience with Cascading, you will work through a real-life case study using Natural Language Processing to perform text analysis and search on large volumes of unstructured text. Throughout the book, you will receive expert advice on how to use the portions of the product that are undocumented or have limited documentation. By the end of the book, you will be able to build practical Cascading applications.

What You Will Learn

  • Familiarize yourself with tuples, pipes, taps, and flows and build your first Cascading application
  • Discover how to design, develop, and use custom operations
  • Design, develop, use, and reuse code with subassemblies and Cascades
  • Acquire the skills you need to integrate Cascading with external systems
  • Gain expertise in testing, QA, and performance tuning to run an efficient and successful Cascading project
  • Explore project management methodologies and steps to develop workable solutions
  • Discover the future of big data frameworks and understand how Cascading can help your software to evolve with it
  • Uncover sources of additional information and other tools that can make development tasks a lot easier

Downloading the example code for this book. You can download the example code files for all Packt books you have purchased from your account at http://www.PacktPub.com. If you purchased this book elsewhere, you can visit http://www.PacktPub.com/support and register to have the files e-mailed directly to you.

Table of Contents

  1. Learning Cascading
    1. Table of Contents
    2. Learning Cascading
    3. Credits
    4. Foreword
    5. About the Authors
    6. About the Reviewers
    7. www.PacktPub.com
      1. Support files, eBooks, discount offers, and more
        1. Why subscribe?
        2. Free access for Packt account holders
    8. Preface
      1. What this book covers
      2. What you need for this book
      3. Who this book is for
      4. Conventions
      5. Reader feedback
      6. Customer support
        1. Downloading the example code
        2. Downloading the color images of this book
        3. Errata
        4. Piracy
        5. Questions
    9. 1. The Big Data Core Technology Stack
      1. Reviewing Hadoop
        1. Hadoop architecture
        2. HDFS – the Hadoop Distributed File System
          1. The NameNode
          2. The secondary NameNode
          3. DataNodes
      2. MapReduce execution framework
        1. The JobTracker
        2. The TaskTracker
        3. Hadoop jobs
        4. Distributed cache
        5. Counters
        6. YARN – MapReduce version 2
        7. A simple MapReduce job
        8. Beyond MapReduce
      3. The Cascading framework
        1. The execution graph and flow planner
        2. How Cascading produces MapReduce jobs
      4. Summary
    10. 2. Cascading Basics in Detail
      1. Understanding common Cascading themes
        1. Data flows as processes
      2. Understanding how Cascading represents records
        1. Using tuples and defining fields
          1. Using a Fields object, named field groups, and selectors
          2. Data typing and coercion
        2. Defining schemes
          1. Schemes in detail
          2. TupleEntry
      3. Understanding how Cascading controls data flow
        1. Using pipes
        2. Creating and chaining
        3. Pipe operations
          1. Each
          2. Splitting
          3. GroupBy and sorting
          4. Every
        4. Merging and joining
          1. The Merge pipe
          2. The join pipes – CoGroup and HashJoin
          3. CoGroup
          4. HashJoin
          5. Default output selectors
        5. Using taps
        6. Flow
        7. FlowConnector
        8. Cascades
        9. Local and Hadoop modes
          1. Common errors
      4. Putting it all together
      5. Summary
    11. 3. Understanding Custom Operations
      1. Understanding operations
        1. Operations and fields
        2. The Operation class and interface hierarchy
          1. The basic operation lifecycle
          2. Contexts
          3. FlowProcess
          4. OperationCall<Context>
          5. An operation processing sequence and its methods
        3. Operation types
          1. Each operations
            1. Filters
              1. Filter calling sequence
              2. Built-in filters
            2. Function
              1. Function calling sequence
              2. Built-in functions
          2. Every operations
            1. Aggregator
              1. Aggregator calling sequence
              2. Built-in aggregators
          3. Buffers
            1. Buffer calling sequence
            2. Built-in buffers
          4. Assertions
            1. ValueAssertion calling sequence
            2. GroupAssertion calling sequence
            3. AssertionLevel
            4. Using assertions
            5. Built-in assertions
            6. A note about implementing BaseOperation methods
      2. Summary
    12. 4. Creating Custom Operations
      1. Writing custom operations
        1. Writing a filter
        2. Writing a function
        3. Writing an aggregator
        4. Writing a custom assertion
        5. Writing a buffer
      2. Identifying common use cases for custom operations
        1. Putting it all together
      3. Summary
    13. 5. Code Reuse and Integration
      1. Creating and using subassemblies
        1. Built-in subassemblies
        2. Creating a new custom subassembly
        3. Using custom subassemblies
      2. Using cascades
        1. Building a complex workflow using cascades
        2. Skipping a flow in a cascade
          1. Intermediate file management
      3. Dynamically controlling flows
        1. Instrumentation and counters
          1. Using counters to control flow
          2. Using existing MapReduce jobs
            1. Using fluent programming techniques
          3. The FlowDef fluent interface
      4. Integrating external components
        1. Flow and cascade events
          1. Using external JAR files
        2. Using Cascading as insulation from big data migrations and upgrades
      5. Summary
    14. 6. Testing a Cascading Application
      1. Debugging a Cascading application
        1. Getting your environment ready for debugging
          1. Using Cascading local mode debugging
          2. Setting up Eclipse
        2. Remote debugging
        3. Using assertions
        4. The Debug() filter
          1. Managing exceptions with traps
        5. Checkpoints
        6. Managing bad data
        7. Viewing flow sequencing using DOT files
      2. Testing strategies
        1. Unit testing and JUnit
        2. Mocking
        3. Integration testing
        4. Load and performance testing
      3. Summary
    15. 7. Optimizing the Performance of a Cascading Application
      1. Optimizing performance
        1. Optimizing Cascading
        2. Optimizing Hadoop
          1. A note about the effective use of checkpoints
      2. Summary
    16. 8. Creating a Real-world Application in Cascading
      1. Project description – Business Intelligence case study on monitoring the competition
      2. Project scope – understanding requirements
        1. Understanding the project domain – text analytics and natural language processing (NLP)
        2. Conducting a simple named entity extraction
      3. Defining the project – the Cascading development methodology
        1. Project roles and responsibilities
        2. Conducting data analysis
        3. Performing functional decomposition
        4. Designing the process and components
          1. Creating and integrating the operations
          2. Creating and using subassemblies
      4. Building the workflow
        1. Building flows
          1. Managing the context
        2. Building the cascade
        3. Designing the test plan
          1. Performing a unit test
          2. Performing an integration test
          3. Performing a cluster test
          4. Performing a full load test
        4. Refining and adjusting
        5. Software packaging and delivery to the cluster
      5. Next steps
      6. Summary
    17. 9. Planning for Future Growth
      1. Finding online resources
      2. Using other Cascading tools
        1. Lingual
        2. Pattern
        3. Driven
        4. Fluid
        5. Load
        6. Multitool
        7. Support for other languages
        8. Hortonworks
      3. Custom taps
      4. Cascading serializers
      5. Java open source mock frameworks
      6. Summary
    18. A. Downloadable Software
      1. Contents
      2. Installing and using
    19. Index