Business Intelligence with Databricks SQL

Book description

Master critical skills needed to deploy and use Databricks SQL and elevate your BI from the warehouse to the lakehouse with confidence

Key Features

  • Learn about business intelligence on the lakehouse with features and functions of Databricks SQL
  • Make the most of Databricks SQL by getting to grips with the enablers of its data warehousing capabilities
  • A unique approach to teaching concepts and techniques with follow-along scenarios on real datasets

Book Description

In this new era of data platform system design, data lakes and data warehouses are giving way to the lakehouse – a new type of data platform system that aims to unify all data analytics into a single platform. Databricks, with its Databricks SQL product suite, is the hottest lakehouse platform out there, harnessing the power of Apache Spark™, Delta Lake, and other innovations to enable data warehousing capabilities on the lakehouse with data lake economics.

This book is a comprehensive hands-on guide that helps you explore all the advanced features, use cases, and technology components of Databricks SQL. You'll start with the lakehouse architecture fundamentals and understand how Databricks SQL fits into it. The book then shows you how to use the platform, from exploring data, executing queries, building reports, and using dashboards through to learning the administrative aspects of the lakehouse – data security, governance, and management of the computational power of the lakehouse. You'll also delve into the core technology enablers of Databricks SQL – Delta Lake and Photon. Finally, you'll get hands-on with advanced SQL commands for ingesting data and maintaining the lakehouse.

By the end of this book, you'll have mastered Databricks SQL and be able to deploy and deliver fast, scalable business intelligence on the lakehouse.

What you will learn

  • Understand how Databricks SQL fits into the Databricks Lakehouse Platform
  • Perform everyday analytics with Databricks SQL Workbench and business intelligence tools
  • Organize and catalog your data assets
  • Program the data security model to protect and govern your data
  • Tune SQL warehouses (computing clusters) for optimal query experience
  • Tune the Delta Lake storage format for maximum query performance
  • Deliver extreme performance with the Photon query execution engine
  • Implement advanced data ingestion patterns with Databricks SQL

Who this book is for

This book is for business intelligence practitioners, data warehouse administrators, and data engineers who are new to Databrick SQL and want to learn how to deliver high-quality insights unhindered by the scale of data or infrastructure. This book is also for anyone looking to study the advanced technologies that power Databricks SQL. Basic knowledge of data warehouses, SQL-based analytics, and ETL processes is recommended to effectively learn the concepts introduced in this book and appreciate the innovation behind the platform.

Table of contents

  1. Business Intelligence with Databricks SQL
  2. Contributors
  3. About the author
  4. About the reviewers
  5. Preface
    1. Who this book is for
    2. What this book covers
    3. To get the most out of this book
    4. Download the example code files
    5. Download the color images
    6. Conventions used
    7. Get in touch
    8. Share Your Thoughts
  6. Part 1: Databricks SQL on the Lakehouse
  7. Chapter 1: Introduction to Databricks
    1. Technical requirements
    2. An overview of Databricks, the company
    3. An overview of the Lakehouse architecture
    4. An overview of the Databricks Lakehouse platform
    5. Summary
  8. Chapter 2: The Databricks Product Suite – A Visual Tour
    1. Technical requirements
    2. Basic navigation with the sidebar
      1. The top of the sidebar
      2. The bottom of the sidebar
      3. The middle of the sidebar
    3. The SQL persona view
    4. The Machine Learning persona view
    5. The Data Science and Engineering persona view
    6. Summary
  9. Chapter 3: The Data Catalog
    1. Technical requirements
    2. Understanding the data organization model in 
Databricks SQL
      1. Apache Hive Metastore
      2. Unity Catalog
      3. Implications of choosing a cataloging technology
      4. An example of the data organization model
    3. Exploring data visually with the Data Catalog
    4. Exploring the data programmatically with SQL statements
    5. Summary
  10. Chapter 4: The Security Model
    1. Technical requirements
    2. The Databricks SQL security model
      1. Access control with Apache Hive Metastore
      2. Access control with Unity Catalog
      3. Query execution model
    3. User-facing table access control
      1. Users, groups, and service principals
      2. Securable objects
      3. Operations
      4. Privileges
      5. Bringing everything together
      6. The security model in practice
      7. Ownership
      8. Sharing the database
      9. Exploring the database
      10. Exploring asset metadata
      11. Revoking access
      12. Denying access
      13. Going beyond read access – part 1
      14. Going beyond read access – part 2
      15. Going beyond read access – part 3
      16. Summarizing the security model
      17. UI-based user-facing table access control
    4. The internals of cloud storage access
      1. Cloud storage access in Microsoft Azure
      2. Cloud storage access in Amazon Web Services
    5. Summary
  11. Chapter 5: The Workbench
    1. Technical requirements
    2. Working with queries
      1. Developing queries
    3. Visualizing query results
    4. Creating and publishing dashboards
    5. Administering and governing artifacts
    6. Summary
  12. Chapter 6: The SQL Warehouses
    1. Technical requirements
    2. Understanding the SQL Warehouse architecture
    3. Creating and configuring SQL Warehouses
      1. Cluster size
      2. Scaling
      3. Spot instance policy
      4. Auto Stop
    4. The art of SQL Warehouse sizing
      1. Rules for query routing, queuing, and cluster autoscaling
      2. Sizing the SQL Warehouse
    5. Organizing and governing SQL Warehouses
      1. SQL Warehouse assignment strategy
      2. Access control in SQL Warehouses
      3. Chargeback
    6. Using Serverless SQL
    7. Summary
  13. Chapter 7: Using Business Intelligence Tools with Databricks SQL
    1. Technical requirements
    2. Connecting from validated BI tools
      1. SQL Warehouse details
      2. Authentication details
    3. Connecting from non-validated BI tools
      1. Step 1 – download the driver
      2. Step 2 – install the driver
      3. Step 3 – configure the SQL Warehouse connection
    4. Connecting programmatically
    5. Databricks Partner Connect
      1. The internals of a data source file
    6. Summary
  14. Part 2: Internals of Databricks SQL
  15. Chapter 8: The Delta Lake
    1. Technical requirements
    2. Fundamentals of the Delta Lake storage format
      1. Data engineering before Delta Lake
      2. The Delta Lake storage format
      3. Data engineering after Delta Lake
    3. Built-in performance-boosting features of Delta Lake
      1. Automatic statistics collection
      2. Automatic Compaction and Optimized Writes
      3. Automatic caching
    4. Configurable performance-boosting features of Delta Lake
      1. Z-ordering
      2. Bloom filter indexes
      3. CACHE SELECT
    5. Summary
  16. Chapter 9: The Photon Engine
    1. Technical requirements
    2. Understanding Photon Engine
      1. What is Photon?
      2. The Apache Spark execution model
    3. Understanding vectorization
      1. Volcano model
      2. Code generation
      3. Vectorization
    4. Discussing the Photon product roadmap
    5. Summary
    6. Further reading
  17. Chapter 10: Warehouse on the Lakehouse
    1. Technical requirements
    2. Organizing data on the Lakehouse
      1. Components of a warehouse system
      2. The Medallion architecture
    3. Implementing data modeling techniques
      1. The bronze layer
      2. The silver layer
      3. The gold layer
    4. Summary
  18. Part 3: Databricks SQL Commands
  19. Chapter 11: SQL Commands – Part 1
    1. Technical requirements
    2. Working with data definition language commands
      1. DDL for catalogs
      2. DDL for external locations
      3. DDL for Delta Sharing
    3. Working with data manipulation language commands
      1. MERGE INTO
      2. COPY INTO
    4. Working with the inbuilt functions in Databricks SQL
      1. JSON
      2. Lambda functions
    5. Summary
  20. Chapter 12: SQL Commands – Part 2
    1. Technical requirements
    2. Working with Delta Lake maintenance commands
      1. Vacuuming your Delta Lake
      2. Time -traveling in your Delta Lake
      3. Repairing your Delta Lake
      4. Optimizing your Delta Lake
    3. Working with data security commands
      1. Dynamic view functions
      2. Controlling access to columns
      3. Controlling access to rows
    4. Working with metadata commands
      1. Listing data assets
      2. Describing data assets
      3. Analyzing Delta tables
    5. Summary
  21. Part 4: TPC-DS, Experiments, and Frequently Asked Questions
  22. Chapter 13: Playing with the TPC-DS Dataset
    1. Technical requirements
    2. Understanding the TPC-DS dataset
    3. Generating TPC-DS data
      1. Building the spark-sql-perf library
      2. Installing the spark-sql-perf library
      3. Creating a data generation cluster
      4. Importing the spark-sql-perf repository
      5. Running the data generation notebook
    4. Running automated benchmarks
    5. Experimenting with TPC-DS in Databricks SQL
      1. Case study 1 – the effect of file formats
      2. Case study 2 – the effect of specialized data types
      3. Case study 3 – the effect of NULLs
      4. Case study 4 – ZORDER and partitions
      5. Case study 5 – Bloom filter indexes
    6. Summary
  23. Chapter 14: Ask Me Anything
    1. Frequently asked questions
    2. Summary
  24. Index
    1. Why subscribe?
  25. Other Books You May Enjoy
    1. Packt is searching for authors like you
    2. Share Your Thoughts

Product information

  • Title: Business Intelligence with Databricks SQL
  • Author(s): Vihag Gupta
  • Release date: September 2022
  • Publisher(s): Packt Publishing
  • ISBN: 9781803235332