book

Learning and Operating Presto

by Angelica Lo Duca, Tim Meehan, Vivek Bharathan, Ying Su

September 2023

Intermediate to advanced

191 pages

4h 32m

English

O'Reilly Media, Inc.

Read now

Unlock full access

Why We Wrote This BookWho This Book Is ForConventions Used in This BookUsing Code ExamplesO’Reilly Online LearningHow to Contact UsAcknowledgmentsAngelica Lo DucaTim MeehanVivek BharathanYing Su
Data Warehouses and Data LakesThe Role of Presto in a Data LakePresto Origins and Design ConsiderationsHigh PerformanceHigh ScalabilityCompliance with the ANSI SQL StandardFederation of Data SourcesRunning in the CloudPresto Architecture and Core ComponentsAlternatives to PrestoApache ImpalaApache HiveSpark SQLTrinoPresto Use CasesReporting and DashboardingAd Hoc QueryingETL Using SQLData LakehouseReal-Time Analytics with Real-Time DatabasesIntroducing Our Case StudyConclusion
Presto Manual InstallationRunning Presto on DockerInstalling DockerPresto Docker ImageBuilding and Running Presto on DockerThe Presto SandboxDeploying Presto on KubernetesIntroducing KubernetesConfiguring Presto on KubernetesAdding a New CatalogRunning the Deployment on KubernetesQuerying Your Presto InstanceListing CatalogsListing SchemasListing TablesQuerying a TableConclusion
Service Provider InterfaceConnector ArchitecturePopular ConnectorsThriftWriting a Custom ConnectorPrerequisitesPlugin and ModuleConfigurationMetadataInput/OutputDeploying Your ConnectorApache PinotSetting Up and Configuring PrestoPresto-Pinot Querying in ActionConclusion
Setting Up the EnvironmentPresto ClientDocker ImageKubernetes NodeConnectivity to PrestoREST APIPythonRJDBCNode.jsODBCOther Presto Client LibrariesBuilding a Client Dashboard in PythonSetting Up the ClientBuilding the DashboardConclusion
The Emergence of the LakehouseData Lakehouse ArchitectureData LakeFile StoreFile FormatTable FormatQuery EngineMetadata ManagementData GovernanceData Access ControlBuilding a Data LakehouseConfiguring MinIOConfiguring HMSConfiguring SparkRegistering Hudi Tables with HMSConnecting and Querying PrestoConclusion
Introducing Presto AdministrationConfigurationPropertiesSessionsJVMMonitoringConsoleREST APIMetricsManagementResource GroupsVerifiersSession Properties ManagersNamespace FunctionsConclusion
Introducing Presto SecurityBuilding Secure Communication in PrestoEncryptionKeystore ManagementConfiguring HTTPS/TLSAuthenticationFile-Based AuthenticationLDAPKerberosCreating a Custom AuthenticatorAuthorizationAuthorizing Access to the Presto REST APIConfiguring System Access ControlAuthorization Through Apache RangerConclusion
Introducing Performance TuningReasons for Performance TuningThe Performance Tuning Life CycleQuery Execution ModelApproaches for Performance Tuning in PrestoResource AllocationStorageQuery OptimizationAria ScanTable ScanningRepartitioningImplementing Performance TuningBuilding and Importing the Sample CSV Table in MinIOConverting the CSV Table in ORCDefining the Tuning ParametersRunning TestsConclusion
Introducing ScalabilityReasons to Scale PrestoCommon IssuesDesign ConsiderationsAvailabilityManageabilityPerformanceProtectionConfigurationHow to Scale PrestoMultiple CoordinatorsPresto on SparkSpillingUsing a Cloud ServiceConclusion

Content preview from Learning and Operating Presto

Preface

Data warehousing began by pulling data from operational databases into systems that were more optimized for analytics. These systems were expensive appliances to operate, which meant people were highly judicious about what data was ingested into their data warehousing appliance for analytics.

Over the years, demand for more data has exploded, far outpacing Moore’s law and challenging legacy data warehousing appliances. While this trend is true for the industry at large, certain companies were earlier than others to encounter the scaling challenges this posed.

Facebook was among the earliest companies to attempt to solve this problem in 2012. At the time, Facebook was using Apache Hive to perform interactive analysis. As Facebook’s datasets grew, Hive was found not to be as interactive (read: too slow) as desired. This is largely because the foundation of Hive is MapReduce, which, at the time, required intermediate datasets to be persisted to disk. This required a lot of I/O to disk for transient, intermediate result sets. So Facebook developed Presto, a new distributed SQL query engine designed as an in-memory engine without the need to persist intermediate result sets for a single query. This approach led to a query engine that processed the same query orders of magnitude faster, with many queries completing with less-than-a-second latency. End users such as engineers, product managers, and data analysts found they could interactively query fractions of large datasets ...