book

Scalable Data Architecture with Java

Name: Scalable Data Architecture with Java
Author: Sinchan Banerjee
ISBN: 9781801073080

by Sinchan Banerjee

September 2022

Beginner to intermediate

382 pages

9h 35m

English

Packt Publishing

Read now

Unlock full access

Scalable Data Architecture with Java
ContributorsAbout the authorAbout the reviewers
Preface
Who this book is forWhat this book coversTo get the most out of this bookDownload the example code filesDownload the color imagesConventions usedGet in touchShare Your Thoughts
Section 1 – Foundation of Data Systems
Chapter 1: Basics of Modern Data Architecture
Exploring the landscape of data engineering What is data engineering?Dimensions of dataTypes of data engineering problemsResponsibilities and challenges of a Java data architectData architect versus data engineerChallenges of a data architectTechniques to mitigate those challengesSummary
Chapter 2: Data Storage and Databases
Understanding data types, formats, and encodingsData typesData formatsUnderstanding file, block, and object storageFile storageBlock storageObject storageThe data lake, data warehouse, and data martData lakeData warehouseData martsDatabases and their typesRelational database NoSQL database Data model design considerationsSummary
Chapter 3: Identifying the Right Data Platform
Technical requirementsVirtualization and containerization platformsBenefits of virtualizationContainerizationBenefits of containerizationKubernetesHadoop platformsHadoop architectureCloud platformsBenefits of cloud computingChoosing the correct platformWhen to choose virtualization versus containerizationWhen to use big dataChoosing between on-premise versus cloud-based solutionsChoosing between various cloud vendorsSummary
Section 2 – Building Data Processing Pipelines
Chapter 4: ETL Data Load – A Batch-Based Solution to Ingesting Data in a Data Warehouse
Technical requirementsUnderstanding the problem and source dataProblem statementUnderstanding the source dataBuilding an effective data modelRelational data warehouse schemasEvaluation of the schema designDesigning the solutionImplementing and unit testing the solutionSummary
Chapter 5: Architecting a Batch Processing Pipeline
Technical requirementsDeveloping the architecture and choosing the right tools Problem statementAnalyzing the problemArchitecting the solutionFactors that affect your choice of storageDetermining storage based on costThe cost factor in the processing layerImplementing the solutionProfiling the source dataWriting the Spark applicationDeploying and running the Spark applicationDeveloping and testing a Lambda triggerPerformance tuning a Spark jobQuerying the ODL using AWS AthenaSummary
Chapter 6: Architecting a Real-Time Processing Pipeline
Technical requirementsUnderstanding and analyzing the streaming problemProblem statementAnalyzing the problemArchitecting the solutionImplementing and verifying the designSetting up Apache Kafka on your local machineDeveloping the Kafka streaming applicationUnit testing a Kafka Streams applicationConfiguring and running the applicationCreating a MongoDB Atlas cloud instance and databaseConfiguring Kafka Connect to store the results in MongoDBVerifying the solutionSummary

Chapter 7: Core Architectural Design Patterns
Core batch processing patternsThe staged Collect-Process-Store patternCommon file format processing patternThe Extract-Load-Transform patternThe compaction pattern The staged report generation patternCore stream processing patternsThe outbox pattern The saga pattern The choreography patternThe Command Query Responsibility Segregation (CQRS) pattern The strangler fig pattern The log stream analytics pattern Hybrid data processing patternsThe Lambda architecture The Kappa architecture Serverless patterns for data ingestionSummary
Chapter 8: Enabling Data Security and Governance
Technical requirementsIntroducing data governance – what and why When to consider data governanceThe DGI data governance frameworkPractical data governance using DataHub and NiFiCreating the NiFi pipelineSetting up DataHubGovernance activitiesUnderstanding the need for data securitySolution and tools available for data securitySummary
Section 3 – Enabling Data as a Service
Chapter 9: Exposing MongoDB Data as a Service
Technical requirementsIntroducing DaaS – what and why Benefits of using DaaSCreating a DaaS to expose data using Spring BootProblem statementAnalyzing and designing a solutionImplementing the Spring Boot REST applicationDeploying the application in an ECS clusterAPI managementEnabling API management over the DaaS API using AWS API GatewaySummary
Chapter 10: Federated and Scalable DaaS with GraphQL
Technical requirementsIntroducing GraphQL – what, when, and why Operation typesWhy use GraphQL?When to use GraphQLCore architectural patterns of GraphQLA practical use case – exposing federated data models using GraphQLSummary
Section 4 – Choosing Suitable Data Architecture
Chapter 11: Measuring Performance and Benchmarking Your Applications
Performance engineering and planning Performance engineering versus performance testingTools for performance engineeringPublishing performance benchmarksOptimizing performanceJava Virtual Machine and garbage collection optimizationsBig data performance tuningOptimizing streaming applicationsDatabase tuningSummary
Chapter 12: Evaluating, Recommending, and Presenting Your Solutions
Creating cost and resource estimations Storage and compute capacity planningEffort and timeline estimationCreating an architectural decision matrixData-driven architectural decisions to mitigate riskPresenting the solution and recommendationsSummary
Index
Why subscribe?
Other Books You May EnjoyPackt is searching for authors like youShare Your Thoughts

Content preview from Scalable Data Architecture with Java

6 Architecting a Real-Time Processing Pipeline

In the previous chapter, we learned how to architect a big data solution for a high-volume batch-based data engineering problem. Then, we learned how big data can be profiled using Glue DataBrew. Finally, we learned how to logically choose between various technologies to build a Spark-based complete big data solution in the cloud.

In this chapter, we will discuss how to analyze, design, and implement a real-time data analytics solution to solve a business problem. We will learn how the reliability and speed of processing can be achieved with the help of distributed messaging systems such as Apache Kafka to stream and process the data. Here, we will discuss how to write a Kafka Streams application ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Hands-On Software Architecture with Java

Publisher Resources

ISBN: 9781801073080

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills

Scalable Data Architecture with Java

by Sinchan Banerjee

6

Architecting a Real-Time Processing Pipeline

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.