book

Building Data Integration Solutions

Name: Building Data Integration Solutions
Author: Jay Borthen
ISBN: 9781098173067

by Jay Borthen

October 2025

Intermediate to advanced

284 pages

English

O'Reilly Media, Inc.

Read now

Unlock full access

Preface
Overview of the Book Structure and What Readers Can Expect to LearnConventions Used in This BookUsing Code ExamplesO’Reilly Online LearningHow to Contact UsAcknowledgments
I. Foundations of Data Integration
1. Introduction to Data Integration
Data Integration and Data ManagementDefining Data IntegrationWhy Data Integration Is ImportantThe Evolution of Data IntegrationData Integration Use Cases and Case StudiesHealthcareTax AdministrationImmigration and Border ControlConclusion
2. Key Concepts in Data Integration
Data PropertiesData TypesData Structure TypesMetadataData OrientationEncodingsFile FormatsData ContextData StoresTypes of StorageData Models and Management SystemsHybrid and Multicloud StorageData Movement and TransformationConnectors and ConnectionsMigrationIngestionReplicationBatches, Streams, and EventsPipelinesConditioningChange Data CaptureIntegration ManagementData ServicesData OrchestrationConclusion
3. Data Integration Challenges
Organizational IssuesTechnical ChallengesData QualityData ProcessingSecurity and ComplianceConclusion
4. Models, Architectures, Methods, and Patterns
ModelsConceptual Data Integration ModelsLogical Data Integration ModelsPhysical Data Integration MethodsArchitecturesHub-and-SpokePoint-to-PointEnterprise Service BusFederationMethodsPatternsIngestion PatternsData Consolidation PatternData Replication and Propagation PatternData Virtualization PatternEvent-Driven Integration PatternConclusion
II. Tools, Technologies, and Frameworks
5. Data Integration Tool Options
Open Source Versus Commercial SolutionsAdvantages of Open Source SolutionsAdvantages of Commercial SolutionsProgramming Languages Versus Low-Code/No-Code PlatformsCloud Versus On-Premises ArchitecturesOn-Premises ConsiderationsCloud Service ProvidersDistributed Versus Centralized Data SystemsIn-Memory ProcessingSecurity and ComplianceConclusion
6. Data Stores and Management Systems
Relational DatabasesIBM Db2Microsoft SQL ServerMySQL and MariaDBOracle DatabasePostgreSQLSQLiteSybase and SAPNon-Relational DatabasesDocument Stores and Key-Value StorageGraph DatabasesVector DatabasesWide-Column DatabasesData WarehousesAmazon RedshiftApache Doris, Druid, Hadoop, and HiveCloudera Data WarehouseIBM Db2 WarehouseSnowflakeData Lakes and LakehousesAmazon Simple Storage ServiceApache Hudi and IcebergAzure Blob StorageDelta LakeGoogle Cloud StorageIBM Cloud Storage ServicesConclusion
7. Data Ingestion and Streaming Tools
Apache Beam, Flink, Spark, and StormApache NiFiAWS Glue and Amazon KinesisAzure Event HubsConfluent and KafkaConclusion

8. Comprehensive Integration Suites
AWS Glue, Amazon Elastic MapReduce, and Amazon QAzure Data FactoryDatabricksFivetranIBM DataStage and App ConnectIICS and PowerCenterMicrosoft SQL Server Integration ServicesMuleSoftOracle Data Integrator and GoldenGatePentahoQlik, Talend, and StitchTIBCOConclusion
III. Introducing the Example Data Integration Solution
9. Introducing the Example Solution
ObjectivesInitial StatePlanned ArchitectureConclusion
10. Implementing a Batch Solution
Setting Up Qlik ReplicateSetting Up a Windows Server EC2 Instance for Qlik ReplicateInstalling and Downloading Qlik ReplicateSetting Up Endpoint ConnectionsSetting Up DatabricksSetting Up Databricks in AWSConnecting DatabricksConclusion
11. Implementing a Streaming Solution
Raspberry Pi and Sensor SetupBill of MaterialsSensor ConfigurationCreating a Confluent Cloud ClusterCreating a Local Python EnvironmentCluster SettingsCreating a TopicConfiguring a ClientCreating the Python Producer and Consumer ApplicationsSetting Up a ConnectorConclusion
A. Setting Up the Data Integration Solution Example
Ubuntu Linux and PostgreSQLSetting Up Ubuntu AWS EC2 InstancesCreating a PostgreSQL Database in the Rocky Linux AWS EC2 InstanceA Final Thought on Sensor Devices
B. References
Key Terms Glossary
Acronyms Glossary
Index
About the Author

Content preview from Building Data Integration Solutions

Chapter 7. Data Ingestion and Streaming Tools

Data ingestion tools facilitate linking operational applications with analytical tools that produce reports and can provide organized data for machine learning models. Ingestion tools also greatly affect the data processing abilities of an organization, because data that cannot be accurately and reliably ingested in a timely manner will lose its usefulness. In this chapter, we cover some of the more well-known contemporary data ingestion and streaming tools.

By understanding the strengths and trade-offs of these tools, organizations can design robust and scalable ingestion pipelines that align with their strategic objectives. And, as data continues to grow in volume, velocity, and variety, selecting the right tools and frameworks will remain crucial for maintaining competitive advantage.

Apache Beam, Flink, Spark, and Storm

Apache Beam is an open source, unified programming environment for defining both batch and streaming data processing pipelines. It allows developers to build pipelines that can run on a variety of execution engines (or runners), such as Apache Flink and Apache Spark (which we’ll go over shortly). Beam abstracts the complexities of parallel computing and simplifies the development of data-intensive applications. It supports key features like windowing, event-time processing, and a rich set of built-in transforms, making it flexible for near-real-time and batch workloads.

Beam can read your data from a diverse set ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

The Definitive Guide to Data Integration

Publisher Resources

ISBN: 9781098173050Errata Page

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills

Building Data Integration Solutions

by Jay Borthen

Chapter 7. Data Ingestion and Streaming Tools

Apache Beam, Flink, Spark, and Storm

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.