book

Learning and Operating Presto

by Angelica Lo Duca, Tim Meehan, Vivek Bharathan, Ying Su

September 2023

Intermediate to advanced

191 pages

4h 32m

English

O'Reilly Media, Inc.

Read now

Unlock full access

Why We Wrote This BookWho This Book Is ForConventions Used in This BookUsing Code ExamplesO’Reilly Online LearningHow to Contact UsAcknowledgmentsAngelica Lo DucaTim MeehanVivek BharathanYing Su
Data Warehouses and Data LakesThe Role of Presto in a Data LakePresto Origins and Design ConsiderationsHigh PerformanceHigh ScalabilityCompliance with the ANSI SQL StandardFederation of Data SourcesRunning in the CloudPresto Architecture and Core ComponentsAlternatives to PrestoApache ImpalaApache HiveSpark SQLTrinoPresto Use CasesReporting and DashboardingAd Hoc QueryingETL Using SQLData LakehouseReal-Time Analytics with Real-Time DatabasesIntroducing Our Case StudyConclusion
Presto Manual InstallationRunning Presto on DockerInstalling DockerPresto Docker ImageBuilding and Running Presto on DockerThe Presto SandboxDeploying Presto on KubernetesIntroducing KubernetesConfiguring Presto on KubernetesAdding a New CatalogRunning the Deployment on KubernetesQuerying Your Presto InstanceListing CatalogsListing SchemasListing TablesQuerying a TableConclusion
Service Provider InterfaceConnector ArchitecturePopular ConnectorsThriftWriting a Custom ConnectorPrerequisitesPlugin and ModuleConfigurationMetadataInput/OutputDeploying Your ConnectorApache PinotSetting Up and Configuring PrestoPresto-Pinot Querying in ActionConclusion
Setting Up the EnvironmentPresto ClientDocker ImageKubernetes NodeConnectivity to PrestoREST APIPythonRJDBCNode.jsODBCOther Presto Client LibrariesBuilding a Client Dashboard in PythonSetting Up the ClientBuilding the DashboardConclusion
The Emergence of the LakehouseData Lakehouse ArchitectureData LakeFile StoreFile FormatTable FormatQuery EngineMetadata ManagementData GovernanceData Access ControlBuilding a Data LakehouseConfiguring MinIOConfiguring HMSConfiguring SparkRegistering Hudi Tables with HMSConnecting and Querying PrestoConclusion
Introducing Presto AdministrationConfigurationPropertiesSessionsJVMMonitoringConsoleREST APIMetricsManagementResource GroupsVerifiersSession Properties ManagersNamespace FunctionsConclusion
Introducing Presto SecurityBuilding Secure Communication in PrestoEncryptionKeystore ManagementConfiguring HTTPS/TLSAuthenticationFile-Based AuthenticationLDAPKerberosCreating a Custom AuthenticatorAuthorizationAuthorizing Access to the Presto REST APIConfiguring System Access ControlAuthorization Through Apache RangerConclusion
Introducing Performance TuningReasons for Performance TuningThe Performance Tuning Life CycleQuery Execution ModelApproaches for Performance Tuning in PrestoResource AllocationStorageQuery OptimizationAria ScanTable ScanningRepartitioningImplementing Performance TuningBuilding and Importing the Sample CSV Table in MinIOConverting the CSV Table in ORCDefining the Tuning ParametersRunning TestsConclusion
Introducing ScalabilityReasons to Scale PrestoCommon IssuesDesign ConsiderationsAvailabilityManageabilityPerformanceProtectionConfigurationHow to Scale PrestoMultiple CoordinatorsPresto on SparkSpillingUsing a Cloud ServiceConclusion

Content preview from Learning and Operating Presto

Chapter 9. Operating Presto at Scale

Scalability involves a Presto cluster handling increased demand or usage with minimal impact on performance, ensuring that the system’s response time remains consistent and acceptable even when the workload increases.

We won’t be implementing a specific scenario in this chapter, so you won’t find the code in the book’s GitHub repository since the scalability of your Presto cluster depends on your cluster workload. Instead, we’ll discuss general strategies for scaling your Presto cluster to enable you to adapt them to your specific conditions.

The chapter is organized into four parts. In the first part, we’ll introduce some basic concepts related to scalability, including reasons to scale a Presto cluster and some common issues related to a Presto cluster that needs to be scaled. In the second part, we’ll see some design considerations to consider when you want to scale your Presto cluster. These include availability, manageability, performance, protection, and configuration. Next, we’ll analyze popular approaches for scaling a Presto cluster, including multiple coordinators, Presto on Spark, and spilling. Finally, we’ll focus on how to scale a Presto cluster using a cloud service.

Introducing Scalability

Operating Presto at scale means adding more resources to the system to handle an increased workload. The concept of scalability is slightly different from that of performance tuning, which you learned in Chapter 8. In fact, performance tuning ...