O'Reilly logo

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

HDInsight Essentials

Book Description

Getting to grips with the fundamentals of HDInsight is amazingly straightforward when you delve into this course. It shows you how to manage even the largest volumes of unstructured data to gain business knowledge.

  • Architect a Hadoop solution with a modular design for data collection, distributed processing, analysis, and reporting
  • Build a multi-node Hadoop cluster on Windows servers
  • Establish a Big Data solution using HDInsight with open source software, and provide useful Excel reports
  • Run Pig scripts and build simple charts using Interactive JS (Azure)

In Detail

We live in an era in which data is generated with every action and a lot of these are unstructured; from Twitter feeds, Facebook updates, photos and digital sensor inputs. Current relational databases cannot handle the volume, velocity and variations of data. HDInsight gives you the ability to gain the full value of Big Data with a modern, cloud-based data platform that manages data of any size and type, whether structured or unstructured.

A hands-on guide that shows you how to seamlessly store and process Big Data of all types through Microsoft’s modern data platform; which provides simplicity, ease of management, and an open enterprise-ready Hadoop service all running in the Cloud. You will then learn how to analyze your Hadoop data with PowerPivot, Power View, Excel, and other Microsoft BI tools; thanks to integration with the Microsoft data platform, this will give you a solid foundation to build your own HDInsight solution, both on premise and on Cloud.

Firstly, we will provide an overview of Hadoop and Microsoft Big Data strategy, where HDinsight plays a key role. We will then show you how to set up your HDInsight cluster and take you through the 4 stages of collecting, processing, analysing and reporting. For each of these stages, you will see a practical example with working code.

You will then learn core Hadoop concepts like HDFS and MapReduce. You will also get a closer look at how Microsoft’s HDInsight leverages Hortonworks Data Platform that uses Apache Hadoop. You will then be guided through Hadoop commands and programming using open source software, such as Hive and Pig with HDInsight. Finally, you will learn to analyze and report using PowerPivot, Power View, Excel, and other Microsoft BI tools.

This guide provides step-by-step instructions on how to build a Big Data solution using HDInsight with open source software, provide useful Excel reports, and open up the full value of HDInsight.

Table of Contents

  1. HDInsight Essentials
    1. Table of Contents
    2. HDInsight Essentials
    3. Credits
    4. About the Author
    5. About the Reviewers
    6. www.PacktPub.com
      1. Support files, eBooks, discount offers and more
        1. Why Subscribe?
        2. Free Access for Packt account holders
    7. Preface
      1. What this book covers
      2. What you need for this book
      3. Who this book is for
      4. Conventions
      5. Reader feedback
      6. Customer support
        1. Downloading the example code
        2. Errata
        3. Piracy
        4. Questions
    8. 1. Hadoop and HDInsight in a Heartbeat
      1. Big Data – hype or real?
      2. Apache Hadoop concepts
        1. Core components
        2. Hadoop cluster layout
        3. The Hadoop ecosystem
          1. Data access
          2. Data processing
          3. The Hadoop data store
          4. Management and integration
        4. Hadoop distributions
        5. HDInsight distribution differentiator
        6. End-to-end solution using HDInsight
          1. Key phases of a Hadoop project
            1. Stage 1 – collect data
            2. Stage 2a – process your data (build MapReduce)
            3. Stage 2b – process your data (execute MapReduce)
            4. Stage 3 – analyze data using JavaScript and Pig
            5. Stage 4 – report data using JavaScript charts
      3. Summary
    9. 2. Deploying HDInsight on Premise
      1. HDInsight and Hadoop relationship
      2. Deployment options for on-premise
        1. Windows HDInsight server
        2. Hortonworks Data Platform (HDP for Windows)
        3. Supported platforms for on-premise install
      3. Single-node install
        1. Downloading the software
        2. Running the install wizard
        3. Validating the install
      4. Multinode planning and preparation
        1. Setting up the network
        2. Setting common time on all nodes
        3. Setting up remote scripting
        4. Configuring firewall ports
      5. Multinode installation
        1. Downloading the software
        2. Configuring the multinode install
        3. Running the installer
        4. Validating the install
      6. Managing HDInsight services
      7. Uninstalling HDInsight
      8. Summary
    10. 3. HDInsight Azure Cloud Service
      1. HDInsight Service on Azure
        1. Considerations for Azure HDInsight Service
      2. Provision your cluster
      3. HDInsight management dashboard
      4. Verify the cluster and run sample jobs
        1. Access HDFS
        2. Deploy and execute the sample MapReduce job
        3. View job results
      5. Monitor your cluster
      6. Azure storage integration
      7. Remove your cluster
        1. Delete your cluster
        2. Delete your storage
        3. Restore your cluster
      8. Summary
    11. 4. Administering Your HDInsight Cluster
      1. Cluster status
      2. Distributed filesystem health
        1. NameNode URL
        2. Browsing HDFS
      3. MapReduce health
        1. MapReduce summary
        2. MapReduce Job History
      4. Key files
        1. Backing up NameNode content
      5. Summary
    12. 5. Ingesting Data to Your Cluster
      1. Loading data using Hadoop commands
        1. Step 1 – connect to a Hadoop client
        2. Step 2 – get your files on local storage
        3. Step 3 – upload to HDFS
      2. Loading data using Azure Storage Vault (ASV)
        1. Storage access keys
        2. Storage tools
        3. Azure Storage Explorer
          1. Registering your storage account
          2. Uploading files to your blob storage
      3. Loading data using interactive JavaScript
      4. Shipping data to Azure
      5. Loading data using Sqoop
        1. Key benefits
        2. Two modes of using Sqoop
        3. Using Sqoop to import (SQL to Hadoop)
      6. Summary
    13. 6. Transforming Data in Cluster
      1. Transformation scenario
        1. Scenario
        2. Transformation objective
        3. File organization
      2. MapReduce solution
        1. Design
        2. Map code
        3. Reduce code
        4. Driver code
        5. Compiling and packaging the code
        6. Executing MapReduce
        7. Results verification
      3. Hive solution
        1. Overview of Hive
        2. Starting Hive in the HDInsight node
        3. Step 1 – table creation
        4. Step 2 – table loading
        5. Step 3 – summary table creation
        6. Step 4 – verifying the summary table
      4. Pig solution
        1. Pig architecture
        2. Pig or Hive?
        3. Starting Pig in the HDInsight node
        4. Pig Grunt script
          1. Code
          2. Code explanation
          3. Execution
          4. Verification
      5. Summary
    14. 7. Analyzing and Reporting Your Data
      1. Analyzing and reporting using Excel
        1. Step 1 – installing the Hive ODBC driver
        2. Step 2 – creating Hive ODBC data source
        3. Step 3 – importing data to Excel
      2. Hive for ad hoc queries
        1. Creating reference tables
        2. Ad hoc queries
        3. Analytic functions in HiveQL
      3. Interactive JavaScript for analysis and reporting
      4. Other business intelligence tools
      5. Summary
    15. 8. Project Planning Tips and Resources
      1. Architectural considerations
        1. Extensible and modular
        2. Metadata-driven solution
        3. Integration strategy
        4. Security
      2. Project planning
        1. Proof of Concept
        2. Production implementation
        3. Reference sites and blogs
      3. Summary
    16. Index