Architecting Data-Intensive Applications

Book description

Architect and design data-intensive applications and, in the process, learn how to collect, process, store, govern, and expose data for a variety of use cases

Key Features

  • Integrate the data-intensive approach into your application architecture
  • Create a robust application layout with effective messaging and data querying architecture
  • Enable smooth data flow and make the data of your application intensive and fast

Book Description

Are you an architect or a developer who looks at your own applications gingerly while browsing through Facebook and applauding it silently for its data-intensive, yet ?uent and efficient, behaviour? This book is your gateway to build smart data-intensive systems by incorporating the core data-intensive architectural principles, patterns, and techniques directly into your application architecture.

This book starts by taking you through the primary design challenges involved with architecting data-intensive applications. You will learn how to implement data curation and data dissemination, depending on the volume of your data. You will then implement your application architecture one step at a time. You will get to grips with implementing the correct message delivery protocols and creating a data layer that doesn't fail when running high traffic. This book will show you how you can divide your application into layers, each of which adheres to the single responsibility principle. By the end of this book, you will learn to streamline your thoughts and make the right choice in terms of technologies and architectural principles based on the problem at hand.

What you will learn

  • Understand how to envision a data-intensive system
  • Identify and compare the non-functional requirements of a data collection component
  • Understand patterns involving data processing, as well as technologies that help to speed up the development of data processing systems
  • Understand how to implement Data Governance policies at design time using various Open Source Tools
  • Recognize the anti-patterns to avoid while designing a data store for applications
  • Understand the different data dissemination technologies available to query the data in an efficient manner
  • Implement a simple data governance policy that can be extended using Apache Falcon

Who this book is for

This book is for developers and data architects who have to code, test, deploy, and/or maintain large-scale, high data volume applications. It is also useful for system architects who need to understand various non-functional aspects revolving around Data Intensive Systems.

Publisher resources

View/Submit Errata

Table of contents

  1. Title Page
  2. Copyright and Credits
    1. Architecting Data-Intensive Applications
  3. Packt Upsell
    1. Why subscribe?
    2. PacktPub.com
  4. Contributors
    1. About the author
    2. About the reviewer
    3. Packt is searching for authors like you
  5. Preface
    1. Who this book is for
    2. What this book covers
    3. To get the most out of this book
    4. Get in touch
      1. Reviews
  6. Exploring the Data Ecosystem
    1. What is a data ecosystem?
      1. A complex set of interconnected data
      2. Data environment
    2. What constitutes a data ecosystem?
      1. Data sharing
        1. Traffic light protocol
    3. Information exchange policy
      1. Handling policy statements
      2. Action policy statements
      3. Sharing policy statements
      4. Licensing policy statements
      5. Metadata policy statements
    4. The 3 V's
      1. Volume
      2. Variety
      3. Velocity
    5. Use cases
      1. Use case 1 – Security
      2. Use case 2 – Modem data collection
    6. Summary
  7. Defining a Reference Architecture for Data-Intensive Systems
    1. What is a reference architecture?
      1. Problem statement
    2. Reference architecture for a data-intensive system
      1. Component view
      2. Data ingest
      3. Data preparation
      4. Data processing
      5. Workflow management
      6. Data access
      7. Data insight
      8. Data governance
      9. Data pipeline
    3. Oracle's information management conceptual reference architecture
      1. Conceptual view
    4. Oracle's information management reference architecture
      1. Data process view
      2. Reference architecture – business view
      3. Real-life use case examples
        1. Machine learning use case 
        2. Data enrichment use case
        3. Extract transform load use case
    5. Desired properties of a data-intensive system
    6. Defining architectural principles
      1. Principle 1
      2. Principle 2
      3. Principle 3
      4. Principle 4
      5. Principle 5
      6. Principle 6
      7. Principle 7
    7. Listing architectural assumptions
    8. Architectural capabilities
      1. UI capabilities
        1. Content mashup
        2. Multi-channel support
        3. User workflow
        4. AR/VR support
      2. Service gateway/API gateway capabilities
        1. Security
        2. Traffic control
        3. Mediation
        4. Caching
        5. Routing
        6. Service orchestration
      3. Business service capabilities
        1. Microservices
        2. Messaging
        3. Distributed (batch/stream) processing
      4. Data capabilities
        1. Data partitioning
        2. Data replication
    9. Summary
  8. Patterns of the Data Intensive Architecture
    1. Application styles
    2. API Platform
      1. Message-oriented application style
      2. Micro Services application styles
    3. Communication styles
    4. Combining different application styles
    5. Architectural patterns
      1. The retry pattern
      2. The circuit breaker
      3. Throttling
      4. Bulk heads
      5. Event-sourcing
      6. Command and Query Responsibility Segregation
    6. Summary
  9. Discussing Data-Centric Architectures
    1. Coordination service
    2. Reliable messaging
    3. Distributed processing
    4. Distributed storage
    5. Lambda architecture
    6. Kappa architecture
      1. A brief comparison of different leading No-Sql data stores
    7. Summary
  10. Understanding Data Collection and Normalization Requirements and Techniques
    1. Data lineage
      1. Apache Atlas
        1. Apache Atlas high-level architecture
      2. Apache Falcon
    2. Data quality
    3. Types of data sources
    4. Data collection system requirements
    5. Data collection system architecture principles
      1. High-level component architecture
      2. High-level architecture
        1. Service gateway
        2. Discovery server
      3. Architecture technology mapping
    6. An introduction to ETCD
      1. Scheduler
      2. Designing the Micro Service
    7. Summary
  11. Creating a Data Pipeline for Consistent Data Collection, Processing, and Dissemination
    1. Query-Data pipelines
    2. Event-Data Pipelines
      1. Topology 1
      2. Topology 2
      3. Topology 3
      4. Resilience
      5. High-availability
        1. Availability Chart
      6. Clustering
        1. Clustering and Network Partitions
        2. Mirrored queues
        3. Persistent Messages
        4. Data Manipulation and Security
          1. Use Case 1
          2. Use Case 2
        5. Exchanges
          1. Guidelines on choosing the right Exchange Type
        6. Headers versus Topic Exchanges
        7. Routing
          1. Header-Based Content Routing
          2. Topic-Based Content Routing
    3. Alternate Exchanges
    4. Dead-Letter Exchanges
    5. Summary
  12. Building a Robust and Fault-Tolerant Data Collection System
    1. Apache Flume
      1. Flume event flow reliability
      2. Flume multi-agent flow
      3. Flow multiplexer
    2. Apache Sqoop
    3. ELK
      1. Beats
      2. Load-balancing
      3. Logstash
      4. Back pressure
      5. High-availability
    4. Centralized collection of distributed data
      1. Apache Nifi
    5. Summary
  13. Challenges of Data Processing
    1. Making sense of the data
      1. What is data processing?
    2. The 3 + 1 Vs and how they affect choice in data processing design
      1. Cost associated with latency
      2. Classic way of doing things
    3. Sharing resources among processing applications
      1. How to perform the processing
        1. Where to perform the processing
        2. Quality of data
        3. Networks are everywhere
        4. Effective consumption of the data
    4. Summary
  14. Let Us Process Data in Batches
    1. What do we mean by batch processing
    2. Lambda architecture and batch processing
    3. Batch layer components and subcomponents
      1. Read/extract component
      2. Normalizer component
      3. Validation component
      4. Processing component
      5. Writer/formatter component
      6. Basic shell component
      7. Scheduler/executor component
    4. Processing strategy
      1. Data partitioning
      2. Range-based partitioning
      3. Hash-based partitioning
    5. Distributed processing
    6. What are Hadoop and HDFS
      1. NameNode
      2. DataNode
      3. MapReduce
    7. Data pipeline
      1. Luigi
      2. Azkaban
      3. Oozie
      4. AirFlow
    8. Summary
  15. Handling Streams of Data
    1. What is a streaming system?
    2. Capabilities (and non-capabilities) of a streaming application
      1. Lambda architecture's speed layer
        1. Computing real time views
    3. High-level reference architecture
    4. Samza architecture
      1. Architectural concepts
      2. Event-streaming layer
      3. Apache Kafka as an event bus
        1. Message persistence
        2. Persistent Queue Design
        3. Message batch
        4. Kafka and the sendfile operation
        5. Compression
    5. Kafka streams
      1. Stream processing topology
      2. Notion of time in stream processing
    6. Samza's stream processing API
    7. The scheduler/executor component of the streaming architecture
    8. Processing concepts and tradeoffs
      1. Processing guarantees
    9. Micro-batch stream processing
    10. Windowing
      1. Types of windows
    11. Summary
    12. References
  16. Let Us Store the Data
    1. The data explosion problem
    2. Relational Database Management Systems and Big data
    3. Introducing Hadoop, the Big Elephant
      1. Apache YARN
      2. Hadoop Distributed Filesystem
        1. HDFS architecture principles (and assumptions)
      3. High-level architecture of HDFS
    4. HDFS file formats
    5. HBase
      1. Understanding the basics of HBase
      2. HBase data model
      3. HBase architecture
      4. Horizontal scaling with automatic sharding of HBase tables
      5. HMaster, region assignment, and balancing
      6. Components of Apache HBase architecture
      7. Tips for improved performance from your HBase cluster
    6. Graph stores
      1. Background of the use case
      2. Scenario
      3. Solution discussion
      4. Bank fraud data model (as can be designed in a property graph data store such as Neo4J)
    7. Semantic graph
      1. Linked data
      2. Vocabularies
      3. Semantic Query Language
      4. Inference
    8. Stardog
      1. GraphQL queries
      2. Gremlin
      3. Virtual Graphs – a Unifying DAO
      4. Structured data
      5. CVS
    9. BITES – Unstructured/Semistructured document store
      1. Structured data extraction
      2. Text extraction
      3. Document queries
      4. Highly-available clusters
      5. Guarantees
      6. Scaling up
      7. Integration with SPARQL
      8. Data Formats
    10. Data integrity and validating constraints
      1. Strict parsing of RDF
      2. Integrity Constraint Validation
    11. Monitoring and operation
    12. Performance
    13. Summary
    14. Further reading
  17. When Data Dissemination is as Important as Data Itself
    1. Data dissemination
      1. Communication protocol
      2. Target audience
      3. Use case
      4. Response schema
      5. Communication channel
      6. Data dissemination architecture in a threat intel sharing system
      7. Threat intel share – backend
        1. RT query processor
        2. View builder
      8. Threat intel share – frontend
      9. AWS Lambda
      10. AWS API gateway
      11. Cache population
      12. Cache eviction
    2. Discussing the non-functional aspects of the preceding architecture
      1. Non-functional use cases for dissemination architecture
      2. Elastic search and free text search queries
    3. Summary
  18. Other Books You May Enjoy
    1. Leave a review - let other readers know what you think

Product information

  • Title: Architecting Data-Intensive Applications
  • Author(s): Anuj Kumar
  • Release date: July 2018
  • Publisher(s): Packt Publishing
  • ISBN: 9781786465092