book

Data Management at Scale

by Piethein Strengholt

July 2020

Intermediate to advanced

345 pages

10h 47m

English

O'Reilly Media, Inc.

Read now

Unlock full access

Who Is This Book For?What Will I Learn?Navigating Through This BookConventions Used in This BookO’Reilly Online LearningHow to Contact UsAcknowledgments
Data ManagementAnalytics Is Fragmenting the Data LandscapeSpeed of Software Delivery Is ChangingNetworks Are Getting FasterPrivacy and Security Concerns Are a Top PriorityOperational and Transactional Systems Need to Be IntegratedData Monetization Requires an Ecosystem-to-Ecosystem ArchitectureEnterprises Are Saddled with Outdated Data ArchitecturesEnterprise Data Warehouse and Business IntelligenceData LakeCentralized ViewSummary
Universally Acknowledged Starting PointsEach Application Has an Application DatabaseApplications Are Specific and Have Unique ContextGolden SourceThere’s No Escape from the Data Integration DilemmaApplications Play the Roles of Data Providers and Data ConsumersKey Theoretical ConsiderationsObject-Oriented Programming PrinciplesDomain-Driven DesignBusiness ArchitectureCommunication and Integration PatternsPoint-to-PointSilosHub-Spoke ModelScaled ArchitectureGolden Sources and Domain Data StoresData Delivery Contracts and Data Sharing AgreementsEliminating the Siloed ApproachDomain-Driven Design on an Enterprise ScaleRead-Optimized DataData Layer as a Holistic PictureMetadata and the Target Operating ModelSummary
Introducing the RDS ArchitectureCommand and Query Responsibility SegregationWhat Is CQRS?CQRS at ScaleRead-Only Data Store Components and ServicesMetadataData QualityRDS TiersData IngestionIntegrating Commercial Off-the-Shelf SolutionsExtracting Data from External APIs and SaaSsHistorical Data ServiceDesign VariationsData ReplicationAccess LayerFile Manipulation ServiceDelivery Notification ServiceDe-Identification ServiceDistributed OrchestrationIntelligent Consumption ServicesPopulating RDSs on DemandRDS Direct Usage ConsiderationsSummary
Introducing the API ArchitectureWhat Is Service-Oriented Architecture?Enterprise Application IntegrationService OrchestrationService ChoreographyPublic Services and Private ServicesService Models and Canonical Data ModelsSimilarities Between SOA and Enterprise Data Warehousing ArchitectureModern View on SOAAPI GatewayResponsibility ModelThe New Role of the ESBService ContractsService DiscoveryMicroservicesThe Role of the API Gateway Within MicroservicesFunctionsService MeshMicroservices BoundariesMicroservices Within the API Reference ArchitectureEcosystem CommunicationAPI-Based Communication ChannelsGraphQLBackend for FrontendMetadataUsing RDSs for Real-Time and Intensive ReadsSummary
Introducing the Streaming ArchitectureThe Asynchronous Event Model Makes the DifferenceWhat Do Event-Driven Architectures Look Like?Mediator TopologyBroker TopologyEvent Processing StylesA Gentle Introduction to Apache KafkaDistributed Event DataApache Kafka FeaturesThe Streaming ArchitectureEvent ProducersEvent ConsumersEvent PlatformEvent Sourcing and Command SourcingGovernance ModelBusiness StreamsStreaming Consumption PatternsEvent-Carried State TransferPlaying the Role of an RDSUsing Streaming to Populate RDSsControls and Policies for Guiding the DomainsStreaming as the Operational BackboneGuarantees and ConsistencyConsistency Level“At Least Once, Exactly Once, and at Most Once” ProcessingMessage OrderDead Letter QueueStreaming InteroperabilityMetadata for Governance and Self-Service ModelsSummary
Recap of the ArchitecturesRDS ArchitectureAPI ArchitectureStreaming ArchitectureStrengthening PatternsEnterprise Interoperability StandardsStable Data EndpointsData Delivery ContractsAccessible and Addressable DataCrossing Network PrinciplesEnterprise Data StandardsConsumption-Optimization PrinciplesDiscoverability of MetadataSemantic ConsistencySupplying the Corresponding MetadataData Origination and MovementsReference ArchitectureSummary
Data GovernanceOrganization: Data Governance RolesProcesses: Data Governance ActivitiesPeople: Trust and Ethical, Social, and Economic ConsiderationsTechnology: Golden Source, Ownership, and Application AdministrationData: Golden Sources, Golden Datasets, and ClassificationsData SecurityCurrent Siloed ApproachUnified Data Security for ArchitecturesIdentity ProvidersSecurity Reference Architecture and Data Context ApproachSecurity Process FlowPractical GuidanceRDS ArchitectureAPI ArchitectureStreaming ArchitectureIntelligent Learning EngineSummary
Consumption PatternsUsing Read-Only Data Stores DirectlyDomain Data StoresTarget Operating ModelData Professionals as a Target User GroupBusiness RequirementsNonfunctional RequirementsBuilding the Data Pipeline and Data ModelDistributing Integrated DataBusiness Intelligence CapabilitiesSelf-Service CapabilitiesAnalytical CapabilitiesStandard Infrastructure for Automated DeploymentsStateless ModelsPrescripted and Configured WorkbenchesStandardize on Model Integration PatternsAutomationModel MetadataAdvanced Analytics Reference ArchitectureSummary

Demystifying Master Data ManagementMaster Data Management StylesMDM Reference ArchitectureDesigning a Master Data Management SolutionMDM DistributionMaster Identification NumbersReference Data Versus Master DataDetermining the Scope of Your Enterprise DataMDM and Data Quality as a ServiceCurated DataMetadata ExchangeIntegrated ViewsReusable Components and Integration LogicData RepublishingRelation to Data GovernanceSummary
Metadata ManagementEnterprise Metadata ModelEnterprise Knowledge GraphArchitectural Approaches for Metadata ManagementMetadata InteroperabilityMetadata RepositoriesMarketplace to Provide Rapid Access to Authorized DataSummary
Delivery ModelFully Decentralized ApproachPartially Decentralized ApproachStructuring TeamsInnerSource StrategyCultureTechnology ChoicesThe Decline of Traditional Enterprise ArchitectureBlueprints and DiagramsModern SkillsControl and GovernanceLast Words

Content preview from Data Management at Scale

Glossary

Apache Avro

Apache Avro is an open source project that provides data serialization and data exchange services for the Apache Kafka and Apache Hadoop ecosystems. It has a serialization service programs can use to serialize the data into files or messages efficiently. It relies on a schema-based system (repository). In Avro the schema is always provided with the data. It stores the data definition in JSON. Apache Avro currently works well within the Hadoop ecosystem (including Apache Kafka).

Apache Thrift

Apache Thrift was developed at Facebook in 2007 and is an open source project. It uses a wide ranges of languages and offers a full client/server stack many projects can directly work with. It also uses an IDL (interface definition language) for describing the data types, which is quite similar to JSON and easily readable by humans.

Access tokens

Rather than using a username and password, an access token is used to represent the identity of the user or user’s groups. It can contain additional attributes and abstracts that describe the context in which the token can be used or the time window in which the token is valid.

Accuracy

The degree to which the data reflect the truth or reality. A spelling mistake is a good example of inaccurate data.

ACID

ACID stands for atomicity, consistency, isolation, durability.

Atomicity ensures that a transaction is either fully completed, or is not begun at all. Consistency enforces that the system is in a valid state at the ...