book

Foundations for Architecting Data Solutions

by Ted Malaska, Jonathan Seidman

September 2018

Beginner to intermediate

187 pages

4h 59m

English

O'Reilly Media, Inc.

Read now

Unlock full access

Preface
Who This Book Is ForNavigating This BookConventions Used in This BookUsing Code ExamplesO’Reilly SafariHow to Contact UsAcknowledgments
1. Key Data Project Types and Considerations
Major Data Project TypesData Pipelines and Data StagingPrimary Considerations and Risk ManagementPipeline and Staging Team MakeupData Processing and AnalysisPrimary Considerations and Risk ManagementData Processing and Analytics Team MakeupApplication DevelopmentPrimary Considerations and Risk ManagementApplication Development Team MakeupSummary
2. Evaluating and Selecting Data Management Solutions
Stages of Open Source ProjectsPrivate Incubation StageRelease Stage“Curing Cancer” StageBroken Promises StageHardening StageEnterprise StageDecline and Slow Death StageCommon Life Cycles for Open Source ProjectsOpen Sourcing a Dead ProductThe FollowerEvaluating BenchmarksConsiderations for Technology SelectionUnderstanding the Building BlocksLooking to a Guide for AdviceUsing AnalystsLooking to Market TrendsSummary
3. Managing Risk in Data Projects
Categories of RiskTechnology RiskTeam RiskRequirements RiskManaging RiskCategorizing Risk in Your ArchitectureTechnology RiskStrength of the TeamOther TeamsRequirements RiskTying This All TogetherUsing Prototypes and Proofs of ConceptBuild Two to Three WaysBuild PoCs and Then Throw Them AwayDeployment ConsiderationsUsing InterfacesStart Building EarlyTest Often and Keep RecordsMonitoring and AlertingCommunicating RiskCollaborate and Gain Buy-InShare the RiskUsing Risk as a Negotiation ToolSummary
4. Interface Design
The Human BodyThe Human Body Versus a Data ArchitectureDecouplingDecoupling ConsiderationsSpecializationWhat Makes a Good Interface DesignThe ContractThe AbstractionVersioningBeing DefensiveDocumentation and Naming for InterfacesNonfunctional ConsiderationsAvailabilityResponse-Time GuaranteesLoad CapacityUsing Testing to Determine SLAsCommon Interface ExamplesPublish–SubscribeRequest–Response Asynchronous ExampleRequest–Response Synchronous ExampleSummary
5. Distributed Storage Systems
Attributes of Distributed Storage SystemsStorage System GenealogyPartitioningMutation OptionsRead PathsAvailability Versus ConsistencyPrimary Use CasesStorage System BreakdownHDFSS3 and Object StoresApache HBaseApache CassandraElasticsearch and Apache SolrNewcomers: Apache Kudu and CockroachDBIn-Memory Storage SystemsSummary
6. The Meta of Enterprise Data
Reasons to Care About MetadataVisibilityRelationshipsRegulationTypes of Metadata in a Data ArchitectureData at RestData in MotionMetadata for Source DataMetadata About Data ProcessingReports and DashboardsMetadata CollectionDeclarative Metadata CollectionDiscovery of MetadataMetadata Management in PracticeSummary
7. Ensuring Data Integrity
Examples of Building Data Pipelines to Ensure Data IntegrityPredefined Data PipelinesValidation of Data PipelinesRow CountsDistinct CountFull-Byte ComparisonChecksum ComparisonSummary
8. Data Processing
Attributes of Processing EnginesDAG ManagementCompute IsolationPerformanceFault ToleranceInteraction ModelBatch and/or StreamingData Processing over TimeSummary
Index

Content preview from Foundations for Architecting Data Solutions

Preface

If you’re reading this book, you already know that there have been dramatic shifts in the data management landscape in recent years. We’ve seen a shift from third-party, proprietary solutions to new, open source distributed data systems. Of course, the common term used to refer to these newer solutions is “big data” (a term we find to be less and less useful), but it’s important to note that many of the earlier proprietary systems utilize distributed architectures that can store and process large volumes of data. Although we can apply these proprietary solutions and the newer open source solutions to solve many of the same problems, there are some distinct differences that have contributed to the growth of the newer systems. This includes not just the economies of the open source approach, but also technology approaches that facilitate the implementation of many applications that are challenging with previous solutions.

Along with the growth of these systems, we’ve seen a corresponding growth in books, articles, training, conferences, and so on dedicated to help you, the practitioner, use these systems, so it’s reasonable to ask why yet another book on this “big data” stuff? To quote a cliché, we think the answer is that it becomes easy to miss the forest for the trees. Most of these materials focus on low-level details such as implementing applications using distributed processing engines like MapReduce or Spark or applying advanced algorithms to perform data analysis. ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Building a Scalable Data Warehouse with Data Vault 2.0

Publisher Resources

ISBN: 9781492038733Errata Page

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills

Foundations for Architecting Data Solutions

by Ted Malaska, Jonathan Seidman

Preface

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.