O'Reilly logo

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Foundations for Architecting Data Solutions

Book Description

While many companies ponder implementation details such as distributed processing engines and algorithms for data analysis, this practical book takes a much wider view of big data development, starting with initial planning and moving diligently toward execution. Authors Ted Malaska and Jonathan Seidman guide you through the major components necessary to start, architect, and develop successful big data projects.

Everyone from CIOs and COOs to lead architects and developers will explore a variety of big data architectures and applications, from massive data pipelines to web-scale applications. Each chapter addresses a piece of the software development life cycle and identifies patterns to maximize long-term success throughout the life of your project.

  • Start the planning process by considering the key data project types
  • Use guidelines to evaluate and select data management solutions
  • Reduce risk related to technology, your team, and vague requirements
  • Explore system interface design using APIs, REST, and pub/sub systems
  • Choose the right distributed storage system for your big data system
  • Plan and implement metadata collections for your data architecture
  • Use data pipelines to ensure data integrity from source to final storage
  • Evaluate the attributes of various engines for processing the data you collect

Table of Contents

  1. Preface
    1. Who This Book Is For
    2. Navigating This Book
    3. Conventions Used in This Book
    4. Using Code Examples
    5. O’Reilly Safari
    6. How to Contact Us
    7. Acknowledgments
  2. 1. Key Data Project Types and Considerations
    1. Major Data Project Types
    2. Data Pipelines and Data Staging
      1. Primary Considerations and Risk Management
      2. Pipeline and Staging Team Makeup
    3. Data Processing and Analysis
      1. Primary Considerations and Risk Management
      2. Data Processing and Analytics Team Makeup
    4. Application Development
      1. Primary Considerations and Risk Management
      2. Application Development Team Makeup
    5. Summary
  3. 2. Evaluating and Selecting Data Management Solutions
    1. Stages of Open Source Projects
      1. Private Incubation Stage
      2. Release Stage
      3. “Curing Cancer” Stage
      4. Broken Promises Stage
      5. Hardening Stage
      6. Enterprise Stage
      7. Decline and Slow Death Stage
    2. Common Life Cycles for Open Source Projects
      1. Open Sourcing a Dead Product
      2. The Follower
    3. Evaluating Benchmarks
    4. Considerations for Technology Selection
      1. Understanding the Building Blocks
      2. Looking to a Guide for Advice
      3. Using Analysts
      4. Looking to Market Trends
    5. Summary
  4. 3. Managing Risk in Data Projects
    1. Categories of Risk
      1. Technology Risk
      2. Team Risk
      3. Requirements Risk
    2. Managing Risk
      1. Categorizing Risk in Your Architecture
      2. Technology Risk
      3. Strength of the Team
      4. Other Teams
      5. Requirements Risk
      6. Tying This All Together
    3. Using Prototypes and Proofs of Concept
      1. Build Two to Three Ways
      2. Build PoCs and Then Throw Them Away
      3. Deployment Considerations
    4. Using Interfaces
    5. Start Building Early
    6. Test Often and Keep Records
    7. Monitoring and Alerting
    8. Communicating Risk
      1. Collaborate and Gain Buy-In
      2. Share the Risk
    9. Using Risk as a Negotiation Tool
    10. Summary
  5. 4. Interface Design
    1. The Human Body
      1. The Human Body Versus a Data Architecture
      2. Decoupling
      3. Decoupling Considerations
      4. Specialization
    2. What Makes a Good Interface Design
      1. The Contract
      2. The Abstraction
      3. Versioning
      4. Being Defensive
      5. Documentation and Naming for Interfaces
    3. Nonfunctional Considerations
      1. Availability
      2. Response-Time Guarantees
      3. Load Capacity
      4. Using Testing to Determine SLAs
    4. Common Interface Examples
      1. Publish–Subscribe
      2. Request–Response Asynchronous Example
      3. Request–Response Synchronous Example
    5. Summary
  6. 5. Distributed Storage Systems
    1. Attributes of Distributed Storage Systems
      1. Storage System Genealogy
      2. Partitioning
      3. Mutation Options
      4. Read Paths
      5. Availability Versus Consistency
      6. Primary Use Cases
    2. Storage System Breakdown
      1. HDFS
      2. S3 and Object Stores
      3. Apache HBase
      4. Apache Cassandra
      5. Elasticsearch and Apache Solr
      6. Newcomers: Apache Kudu and CockroachDB
      7. In-Memory Storage Systems
    3. Summary
  7. 6. The Meta of Enterprise Data
    1. Reasons to Care About Metadata
      1. Visibility
      2. Relationships
      3. Regulation
    2. Types of Metadata in a Data Architecture
      1. Data at Rest
      2. Data in Motion
      3. Metadata for Source Data
      4. Metadata About Data Processing
      5. Reports and Dashboards
    3. Metadata Collection
      1. Declarative Metadata Collection
      2. Discovery of Metadata
    4. Metadata Management in Practice
    5. Summary
  8. 7. Ensuring Data Integrity
    1. Examples of Building Data Pipelines to Ensure Data Integrity
      1. Predefined Data Pipelines
    2. Validation of Data Pipelines
      1. Row Counts
      2. Distinct Count
      3. Full-Byte Comparison
      4. Checksum Comparison
    3. Summary
  9. 8. Data Processing
    1. Attributes of Processing Engines
      1. DAG Management
      2. Compute Isolation
      3. Performance
      4. Fault Tolerance
      5. Interaction Model
      6. Batch and/or Streaming
    2. Data Processing over Time
    3. Summary
  10. Index