book

Foundations for Architecting Data Solutions

by Ted Malaska, Jonathan Seidman

September 2018

Beginner to intermediate

187 pages

4h 59m

English

O'Reilly Media, Inc.

Read now

Unlock full access

Preface
Who This Book Is ForNavigating This BookConventions Used in This BookUsing Code ExamplesO’Reilly SafariHow to Contact UsAcknowledgments
1. Key Data Project Types and Considerations
Major Data Project TypesData Pipelines and Data StagingPrimary Considerations and Risk ManagementPipeline and Staging Team MakeupData Processing and AnalysisPrimary Considerations and Risk ManagementData Processing and Analytics Team MakeupApplication DevelopmentPrimary Considerations and Risk ManagementApplication Development Team MakeupSummary
2. Evaluating and Selecting Data Management Solutions
Stages of Open Source ProjectsPrivate Incubation StageRelease Stage“Curing Cancer” StageBroken Promises StageHardening StageEnterprise StageDecline and Slow Death StageCommon Life Cycles for Open Source ProjectsOpen Sourcing a Dead ProductThe FollowerEvaluating BenchmarksConsiderations for Technology SelectionUnderstanding the Building BlocksLooking to a Guide for AdviceUsing AnalystsLooking to Market TrendsSummary
3. Managing Risk in Data Projects
Categories of RiskTechnology RiskTeam RiskRequirements RiskManaging RiskCategorizing Risk in Your ArchitectureTechnology RiskStrength of the TeamOther TeamsRequirements RiskTying This All TogetherUsing Prototypes and Proofs of ConceptBuild Two to Three WaysBuild PoCs and Then Throw Them AwayDeployment ConsiderationsUsing InterfacesStart Building EarlyTest Often and Keep RecordsMonitoring and AlertingCommunicating RiskCollaborate and Gain Buy-InShare the RiskUsing Risk as a Negotiation ToolSummary
4. Interface Design
The Human BodyThe Human Body Versus a Data ArchitectureDecouplingDecoupling ConsiderationsSpecializationWhat Makes a Good Interface DesignThe ContractThe AbstractionVersioningBeing DefensiveDocumentation and Naming for InterfacesNonfunctional ConsiderationsAvailabilityResponse-Time GuaranteesLoad CapacityUsing Testing to Determine SLAsCommon Interface ExamplesPublish–SubscribeRequest–Response Asynchronous ExampleRequest–Response Synchronous ExampleSummary
5. Distributed Storage Systems
Attributes of Distributed Storage SystemsStorage System GenealogyPartitioningMutation OptionsRead PathsAvailability Versus ConsistencyPrimary Use CasesStorage System BreakdownHDFSS3 and Object StoresApache HBaseApache CassandraElasticsearch and Apache SolrNewcomers: Apache Kudu and CockroachDBIn-Memory Storage SystemsSummary
6. The Meta of Enterprise Data
Reasons to Care About MetadataVisibilityRelationshipsRegulationTypes of Metadata in a Data ArchitectureData at RestData in MotionMetadata for Source DataMetadata About Data ProcessingReports and DashboardsMetadata CollectionDeclarative Metadata CollectionDiscovery of MetadataMetadata Management in PracticeSummary
7. Ensuring Data Integrity
Examples of Building Data Pipelines to Ensure Data IntegrityPredefined Data PipelinesValidation of Data PipelinesRow CountsDistinct CountFull-Byte ComparisonChecksum ComparisonSummary
8. Data Processing
Attributes of Processing EnginesDAG ManagementCompute IsolationPerformanceFault ToleranceInteraction ModelBatch and/or StreamingData Processing over TimeSummary
Index

Overview

While many companies ponder implementation details such as distributed processing engines and algorithms for data analysis, this practical book takes a much wider view of big data development, starting with initial planning and moving diligently toward execution. Authors Ted Malaska and Jonathan Seidman guide you through the major components necessary to start, architect, and develop successful big data projects.

Everyone from CIOs and COOs to lead architects and developers will explore a variety of big data architectures and applications, from massive data pipelines to web-scale applications. Each chapter addresses a piece of the software development life cycle and identifies patterns to maximize long-term success throughout the life of your project.

Start the planning process by considering the key data project types
Use guidelines to evaluate and select data management solutions
Reduce risk related to technology, your team, and vague requirements
Explore system interface design using APIs, REST, and pub/sub systems
Choose the right distributed storage system for your big data system
Plan and implement metadata collections for your data architecture
Use data pipelines to ensure data integrity from source to final storage
Evaluate the attributes of various engines for processing the data you collect

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Building a Scalable Data Warehouse with Data Vault 2.0

Publisher Resources

ISBN: 9781492038733Errata Page

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills