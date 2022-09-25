Fundamentals of Data Engineering

by Joe Reis, Matt Housley
Released September 2022
Publisher(s): O'Reilly Media, Inc.
ISBN: 9781098108304

Book description

Data engineering has grown rapidly in the past decade, leaving many software engineers, data scientists, and analysts looking for a comprehensive view of this practice. With this practical book, you will learn how to plan and build systems to serve the needs of your organization and customers by evaluating the best technologies available in the framework of the data engineering lifecycle.

Authors Joe Reis and Matt Housley walk you through the data engineering lifecycle and show you how to stitch together a variety of cloud technologies to serve the needs of downstream data consumers. You will understand how to apply the concepts of data generation, ingestion, orchestration, transformation, storage, governance, and deployment that are critical in any data environment regardless of the underlying technology.

This book will help you:

  • Assess data engineering problems using an end-to-end data framework of best practices
  • Cut through marketing hype when choosing data technologies, architecture, and processes
  • Use the data engineering lifecycle to design and build a robust architecture
  • Incorporate data governance and security across the data engineering lifecycle

Table of contents

  1. 1. Data Engineering Described
    1. What is data engineering?
      1. Data Engineering Defined
      2. The Data Engineering Lifecycle
      3. Evolution of the data engineer
      4. Data Engineering and Data Science
    2. Data Engineering Skills and Activities
      1. Data Maturity and the Data Engineer
      2. The background and skills of a data engineer
      3. Business Responsibilities
      4. Technical Responsibilities
      5. The Continuum of Data Engineering Roles, from A to B
    3. Data Engineers Inside an Organization
      1. Internal vs. External Facing Data Engineers
      2. Data Engineers and Other Technical Roles
      3. Data Engineers and Business Leadership
    4. Summary
    5. Further reading
    6. Links
  2. 2. The Data Engineering Lifecycle
    1. What is the Data Engineering Lifecycle?
      1. The Data Lifecycle vs. The Data Engineering Lifecycle
      2. Generation - Source Systems
      3. Storage
      4. Ingestion
      5. Transformation
      6. Serving Data
    2. The major undercurrents across the data engineering lifecycle
      1. Security
      2. Data management
      3. Orchestration
      4. DataOps
      5. Data Architecture
      6. Software Engineering
    3. Summary
    4. Further Reading
      1. Data transformation and processing
      2. Undercurrents
    5. Further watching
  3. 3. Designing Good Data Architecture
    1. What is Data Architecture?
      1. Enterprise Architecture, Defined
      2. Data Architecture Defined
      3. “Good” data architecture
    2. Major Architecture Concepts
      1. Domains and Services
      2. Distributed Systems, Scalability, and Designing for Failure
      3. Tight vs. Loose Coupling - Tiers, Monoliths, and Microservices
      4. User access - Single vs. Multi-Tenant
      5. Event-Driven Architecture
      6. Brownfield vs. Greenfield Projects
    3. Examples & Types of Data Architecture
      1. Data warehouse
      2. Data Lake
      3. Convergence, next-generation data lakes, and the data platform
      4. Data Mesh
      5. Modern Data Stack
      6. Lambda Architecture
      7. Kappa Architecture
      8. The Dataflow Model and Unified Batch and Streaming
      9. Architecture for IoT
      10. Other data architecture examples
    4. Who’s Involved with Designing a Data Architecture?
    5. Summary
    6. Further reading
  4. 4. Choosing technologies across the Data Engineering Lifecycle
    1. Team size and capabilities
    2. Speed to market
    3. Interoperability
    4. Cost optimization and business value
      1. Total Cost of Ownership
      2. FinOps
      3. Total Opportunity Cost of Ownership
    5. Today vs. the future - Immutable vs. Transitory Technologies
      1. Our advice
    6. Location: On-Prem, Cloud, Hybrid, Multi-Cloud, and more
      1. On-premises
      2. Cloud
      3. Hybrid cloud
      4. Multi-cloud
      5. Decentralized - Blockchain and The Edge
      6. Our advice
      7. A discussion of cloud repatriation arguments
    7. Build vs. Buy
      1. Open-source software (OSS)
      2. Proprietary walled gardens
      3. Our advice
    8. Monolith vs. Modular
      1. Monolith
      2. Modularity
      3. The distributed monolith pattern
      4. Our advice
    9. Serverless vs. Infrastructure
      1. Serverless
      2. Containers
      3. When Infrastructure Makes Sense
      4. Our advice
    10. Undercurrents and how they impact choosing technologies
      1. Data Management
      2. DataOps
      3. Data Architecture
      4. Orchestration in Airflow
      5. Software Engineering
    11. Optimization, performance, and the benchmark wars
      1. Big data… for the ‘90s
      2. Nonsensical cost comparisons
      3. Asymmetric optimization
      4. Caveat emptor
    12. Summary
  5. 5. Ingestion
    1. What Is Data Ingestion?
    2. Key Engineering Considerations for the Ingestion Phase
      1. Bounded Versus Unbounded
      2. Frequency
      3. Synchronous Versus Asynchronous Ingestion
      4. Serialization and Deserialization
      5. Throughput and Scalability
      6. Reliability and Durability
      7. Payload
      8. Push Versus Pull Patterns
    3. Batch Ingestion Patterns
      1. Snapshot or Differential Extraction
      2. File-Based Export and Ingestion
      3. ETL Versus ELT
      4. Inserts, Updates, and Batch Size
      5. Data Migration
    4. Streaming Ingestion Patterns
      1. Types of Time
      2. Key Ideas
      3. Streaming Change Data Capture
      4. Real-time and Micro-batch: Considerations for Downstream Destinations
    5. Ingestion Technologies
      1. Batch Ingestion Technologies
      2. Streaming Ingestion Technologies
    6. Who You’ll Work With
      1. Upstream Data Producers
      2. Downstream Data Consumers
    7. Undercurrents
      1. Security
      2. Data Management
      3. DataOps
      4. Orchestration
      5. Software Engineering
    8. Conclusion
