Serverless Analytics with Amazon Athena

Book description

Get more from your data with Amazon Athena’s ease-of-use, interactive performance, and pay-per-query pricing

Key Features

  • Explore the promising capabilities of Amazon Athena and Athena’s Query Federation SDK
  • Use Athena to prepare data for common machine learning activities
  • Cover best practices for setting up connectivity between your application and Athena and security considerations

Book Description

Amazon Athena is an interactive query service that makes it easy to analyze data in Amazon S3 using SQL, without needing to manage any infrastructure.

This book begins with an overview of the serverless analytics experience offered by Athena and teaches you how to build and tune an S3 Data Lake using Athena, including how to structure your tables using open-source file formats like Parquet. You’ll learn how to build, secure, and connect to a data lake with Athena and Lake Formation. Next, you’ll cover key tasks such as ad hoc data analysis, working with ETL pipelines, monitoring and alerting KPI breaches using CloudWatch Metrics, running customizable connectors with AWS Lambda, and more. Moving on, you’ll work through easy integrations, troubleshooting and tuning common Athena issues, and the most common reasons for query failure. You will also review tips to help diagnose and correct failing queries in your pursuit of operational excellence. Finally, you’ll explore advanced concepts such as Athena Query Federation and Athena ML to generate powerful insights without needing to touch a single server.

By the end of this book, you’ll be able to build and use a data lake with Amazon Athena to add data-driven features to your app and perform the kind of ad hoc data analysis that often precedes many of today’s ML modeling exercises.

What you will learn

  • Secure and manage the cost of querying your data
  • Use Athena ML and User Defined Functions (UDFs) to add advanced features to your reports
  • Write your own Athena Connector to integrate with a custom data source
  • Discover your datasets on S3 using AWS Glue Crawlers
  • Integrate Amazon Athena into your applications
  • Setup Identity and Access Management (IAM) policies to limit access to tables and databases in Glue Data Catalog
  • Add an Amazon SageMaker Notebook to your Athena queries
  • Get to grips with using Athena for ETL pipelines

Who this book is for

Business intelligence (BI) analysts, application developers, and system administrators who are looking to generate insights from an ever-growing sea of data while controlling costs and limiting operational burden, will find this book helpful. Basic SQL knowledge is expected to make the most out of this book.

Table of contents

  1. Serverless Analytics with Amazon Athena
  2. Foreword
  3. Contributors
  4. About the authors
  5. About the reviewers
  6. Preface
    1. Who this book is for
    2. What this book covers
    3. To get the most out of this book
    4. Download the example code files
    5. Download the color images
    6. Conventions used
    7. Get in touch
    8. Share Your Thoughts
  7. Section 1: Fundamentals Of Amazon Athena
  8. Chapter 1: Your First Query
    1. Technical requirements
    2. What is Amazon Athena?
      1. Use cases
      2. Separation of storage and compute
    3. Obtaining and preparing sample data
    4. Running your first query
      1. Creating your first table
      2. Running your first analytics queries
    5. Summary
  9. Chapter 2: Introduction to Amazon Athena
    1. Technical requirements
    2. Getting to know Amazon Athena
      1. Understanding the "serverless" trend
      2. Beyond "serverless" with 'fully managed' offerings
      3. Key features
    3. What is Presto?
    4. Understanding scale and latency
      1. TableScan performance
      2. Memory-bound operations
      3. Writing results
    5. Metering and billing
      1. Additional costs
      2. File formats affect cost and performance
      3. Cost controls
    6. Connecting and securing
    7. Determining when to use Amazon Athena
      1. Ad hoc analytics
      2. Adding analytics features to your application
      3. Serverless ETL pipeline
      4. Other use cases
    8. Summary
    9. Further reading
  10. Chapter 3: Key Features, Query Types, and Functions
    1. Technical requirements
    2. Running ETL queries
      1. Using CREATE-TABLE-AS-SELECT
      2. Using INSERT-INTO
    3. Running approximate queries
    4. Organizing workloads with WorkGroups and saved queries
    5. Using Athena's APIs
    6. Summary
  11. Section 2: Building and Connecting to Your Data Lake
  12. Chapter 4: Metastores, Data Sources, and Data Lakes
    1. Technical requirements
    2. What is a metastore?
      1. Data sources, connectors, and catalogs
      2. Databases and schemas
      3. Tables/datasets
    3. What is a data source?
      1. S3 data sources
      2. Other data sources
    4. Registering S3 datasets in your metastore
      1. Using Athena CREATE TABLE statements
      2. Using Athena's Create Table wizard
      3. Using the AWS Glue console
      4. Using AWS Glue Crawlers
    5. Discovering your datasets on S3 using AWS Glue Crawlers
      1. How do AWS Glue Crawlers work?
      2. AWS Glue Crawler best practices for Athena
    6. Designing a data lake architecture
      1. Stages of data
      2. Transforming data using Athena
    7. Summary
      1. Further reading
  13. Chapter 5: Securing Your Data
    1. Technical requirements
    2. General best practices to protect your data on AWS
      1. Separating permissions based on IAM users, roles, or even accounts
      2. Least privilege for IAM users, roles, and accounts
      3. Rotating IAM user credentials frequently
      4. Blocking public access on S3 buckets
      5. Enabling data and metadata encryption and enforcing it
      6. Ensuring that auditing is enabled
      7. Good intentions cannot replace good mechanisms
    3. Encrypting your data and metadata in Glue Data Catalog
      1. Encrypting your data
      2. Encrypting your metadata in Glue Data Catalog
    4. Enabling coarse-grained access controls with IAM resource policies for data on S3
    5. Enabling FGACs with Lake Formation for data on S3
    6. Auditing with CloudTrail and S3 access logs
      1. Auditing with AWS CloudTrail
      2. Auditing with S3 server access logs
    7. Summary
    8. Further reading
  14. Chapter 6: AWS Glue and AWS Lake Formation
    1. Technical requirements
      1. What AWS Glue and AWS Lake Formation can do for you
      2. Securing your data lake with Lake Formation
      3. What AWS Lake Formation governed tables can do for you
    2. Summary
    3. Further reading
  15. Section 3: Using Amazon Athena
  16. Chapter 7: Ad Hoc Analytics
    1. Technical requirements
    2. Understanding the ad hoc analytics hype
    3. Building an ad hoc analytics strategy
      1. Choosing your storage
      2. Sharing data
      3. Selecting query engines
      4. Deploying to customers
    4. Using QuickSight with Athena
      1. Getting sample data
      2. Setting up QuickSight
    5. Using Jupyter Notebooks with Athena
      1. pandas
      2. Matplotlib and Seaborn
      3. SciPy and NumPy
      4. Using our notebook to explore
    6. Summary
  17. Chapter 8: Querying Unstructured and Semi-Structured Data
    1. Technical requirements
    2. Why isn't all data structured to begin with?
    3. Querying JSON data
      1. Reading our customer's dataset
      2. Parsing JSON fields
      3. Other considerations when reading JSON
      4. Querying comma-separated value and tab-separated value data
    4. Querying arbitrary log data
      1. Doing full log scans on S3
      2. Reading application log data
    5. Summary
    6. Further reading
  18. Chapter 9: Serverless ETL Pipelines
    1. Technical requirements
    2. Understanding the uses of ETL
      1. ETL for integration
      2. ETL for aggregation
      3. ETL for modularization
      4. ETL for performance
    3. Deciding whether to ETL or query in place
    4. Designing ETL queries for Athena
      1. Don't forget about performance
      2. Begin with integration points
      3. Use an orchestrator
    5. Using Lambda as an orchestrator
      1. Creating an ETL function
      2. Coding the ETL function
      3. Testing your ETL function
    6. Triggering ETL queries with S3 notifications
    7. Summary
  19. Chapter 10: Building Applications with Amazon Athena
    1. Technical requirements
    2. Connecting to Athena
      1. JDBC and ODBC
      2. Which one should I use?
    3. Best practices for connecting to Athena
      1. Idempotency tokens
      2. Query tracking
    4. Securing your application
      1. Credential management
      2. Network safety
    5. Optimizing for performance and cost
      1. Workload isolation
      2. Application monitoring
      3. CTAS for large result sets
    6. Summary
  20. Chapter 11: Operational Excellence – Monitoring, Optimization, and Troubleshooting
    1. Technical requirements
      1. Monitoring Athena to ensure queries run smoothly
      2. Optimizing for cost and performance
      3. Troubleshooting failing queries
      4. Summary
      5. Further reading
  21. Section 4: Advanced Topics
  22. Chapter 12: Athena Query Federation
    1. Technical requirements
    2. What is Query Federation?
      1. Athena Query Federation features
    3. How Athena Connectors work
      1. Using Lambda for big data
      2. Federating queries across VPCs
    4. Using pre-built Connectors
    5. Building a custom connector
      1. Setting up your development environment
      2. Writing your connector code
    6. Summary
  23. Chapter 13: Athena UDFs and ML
    1. Technical requirements
    2. What are UDFs?
    3. Writing a new UDF
      1. Setting up your development environment
      2. Writing your UDF code
      3. Building your UDF code
      4. Deploying your UDF code
      5. Using your UDF
    4. Using built-in ML UDFs
      1. Pre-setup requirements
      2. Setting up your SageMaker notebook
      3. Using our notebook to train a model
      4. Using our trained model in an Athena UDF
    5. Summary
  24. Chapter 14: Lake Formation – Advanced Topics
    1. Reinforcing your data perimeter with Lake Formation
      1. Establishing a data perimeter
      2. Shared responsibility security model
      3. How Lake Formation can help
    2. Understanding the benefits of governed tables
      1. ACID transactions on S3-backed tables
    3. Summary
      1. Further reading
    4. Why subscribe?
  25. Other Books You May Enjoy
    1. Packt is searching for authors like you
    2. Share Your Thoughts

Product information

  • Title: Serverless Analytics with Amazon Athena
  • Author(s): Anthony Virtuoso, Mert Turkay Hocanin, Aaron Wishnick
  • Release date: November 2021
  • Publisher(s): Packt Publishing
  • ISBN: 9781800562349