Book description
An authoritative guide to designing effective solutions for data cleansing, ETL, and file management with SQL Server 2008 Integration Services
SQL Server Integration Services (SSIS) is the leading tool in the data warehouse industry, used for performing extraction, transformation, and load operations.
After an overview of SSIS architecture, the authors walk you a series of real-world problems and show various techniques for handling them.
Shows you how to design SSIS solutions for data cleansing, ETL and file management
Demonstrates how to integrate data from a variety of data sources,
Shows how to monitor SSIS performance,
Demonstrates how to avoid common pitfalls involved with SSIS deployment
Explains how to ensure performance of the deployed solution and effectively handle unexpected system failures and outages
The companion Web site provides sample code and database scripts that readers can directly implement
This book shows you how to design, build, deploy, and manage solutions to real-world problems that SSIS administrators and developers face day-to-day.
Table of contents
- Cover
- Title
- Copyright
- Dedication
- About the Authors
- Credits
- Acknowledgments
- Introduction
-
Chapter 1: SSIS Solution Architecture
- Problem
- Macro Challenge: Enterprise Data Spaghetti
- Micro Challenge: Data-Processing Confusion
- Problems with Execution and Troubleshooting
- Infrastructure Challenges
- Other Challenges
- Design
- Choosing the Right Tool
- Overall Solution Architecture
- Data Integration or Consolidation
- Data Warehouse ETL
- Project Planning and Team Development
- Agile Benefits
- Agile Cautions and Planning
- Data Element Documentation
- Package Design Patterns
- Modular Packages
- Master Packages
- Server and Storage Hardware
- Server Hardware
- Development and Test Servers
- ETL Collocation
- Storage Hardware
- Package Execution Location
- Package Storage Location Versus Execution Location
- Execute SQL Task and Bulk Insert Task Execution
- Package Execution and the Data Flow
- Design Review
- Solution
- Setting the Stage: Management and Deployment
- Source Data: Files, Tables, and Data Cleansing
- Data Warehouse ETL and Cube Processing
- Advanced ETL: Scripting, High Availability, and Performance
- Summary
-
Chapter 2: SSIS Management Framework Design
- Problem
- Challenges of Not Having a Framework Implementation
- Different Development Methods
- Changing Metadata
- Getting the Right Information to the Right People at the Right Time
- Reusability
- Framework Implementation
- Configuration Management Scheme
- Logging and Auditing
- Template Package
- Framework Benefits and Value
- Design
- Configuration Management
- Overall Design
- Environment Variable Configuration
- XML Configuration File
- SQL Server Configuration
- Logging and Auditing Mechanism
- Storage
- Tables
- Stored Procedures
- Template Package
- Implementation Guidelines
- Solution
- Configuration Management
- Logging and Auditing Mechanism
- Storage and Tables
- Stored Procedures
- Template Package
- Development
- Installation
- Other Considerations
- Customizations
- ETL Process Framework
- Process Owner
- Reporting
- Summary
-
Chapter 3: Package Deployment and Storage Decisions
- Problem
- Standardization
- Environment
- Application
- Desired Implementation
- Design
- Storage Methods
- SSIS Service
- File System Storage
- SQL Server Storage
- Storage Comparison
- Deployment Methods
- SSIS Deployment Wizard
- Manual Deployment of Packages
- DTUtil Scripting
- Solution
- Storage Methodology
- Existing Infrastructure
- Security
- Package Makeup
- Back to the Flowchart
- Deployment Methodology
- Storage Method
- Existing Infrastructure
- Deployment Needs
- Back to the Flowchart
- Total Methodology
- Summary
-
Chapter 4: File-Handling and Processing Methods
- Problem
- Simple File Operations
- Remote File Operations
- File Criteria
- File Iteration
- File Notification
- Design
- File System Task
- Source and Destination Connection
- Attributes
- Overwriting the Destination
- Multiple Files
- FTP Task
- FTP Connection
- HTTP
- Foreach Loop
- Applying Advanced Criteria
- File Watcher Task
- WMI Event Watcher Task
- Solution
- Summary
-
Chapter 5: Data Extraction Best Practices
- Problem
- Extraction Data Criteria
- Source System Impact
- Incremental Extraction
- Deleted Rows
- Staging Database
- Data Lineage and Metadata
- File Sources
- Design
- Package Connections and Source Adapters
- Package Connections
- Source Adapters
- Incremental Data Extraction
- Incremental Extraction Using a Change Identifier Value
- Targeting Change Records through Dynamic Source Queries
- Retrieving Incremental Identifier Values and Updating Package Variables
- Capturing the Maximum Change Identifier Value
- Incremental Extraction from SQL Server without a Trigger
- Using SSIS to Handle All Aspects of an Incremental Extraction
- Data Staging Method
- Tracking Data Lineage Identifiers
- Solution
- Metadata-Driven Extraction Example
- Metadata Tables
- Control Flow
- Looping through the Sources
- Inside the Sources Loop
- The Script
- Read Variables
- Open Connections
- Get Max Change Indicator
- Extract Changed Source Rows
- Close Connections
- Summary
-
Chapter 6: Data-Cleansing Design
- Problem
- Candidate Key Profiles
- Column Length Distribution Profiles
- Column Null Ratio Profiles
- Column Pattern Profiles
- Column Statistics Profiles
- Column Value Distribution Profiles
- Functional Dependency Profiles
- Design
- Using the Script Transform for Data Scrubbing
- Using the Fuzzy Grouping Transform to De-duplicate Data
- Using the Fuzzy Lookup Transform to Cleanse Data
- Dealing with Multiple Record Types in a Single File
- Using the Raw File
- Solution
- Summary
-
Chapter 7: Dimension Table ETL
- Problem — Fundamental Dimension ETL
- Dimensions: The Basics
- Dimension ETL: The Challenge
- Design — Fundamental Dimension ETL
- Data Preparation
- Dimension Change Types
- Type 1 (Changing) Attribute: A Closer Look
- Type 2 (Historical) Attribute: A Closer Look
- Inferred Members
- Solution — Fundamental Dimension ETL
- Preparing Your Source Data for Dimension ETL
- SSIS Slowly Changing Dimension Wizard
- Advanced Properties and Additional Outputs of the SCD
- Problem — Advanced Dimension ETL
- SCD Wizard Advantages and Disadvantages
- Dimension Volume and Complexity
- Design — Advanced Dimension ETL
- Optimizing the Built-in SCD
- Index Optimizations
- Update Optimizations
- Snowflake Dimension Tables
- Parent-Child Dimension ETL
- Date Dimension ETL
- Profile Dimension and Junk Dimension ETL
- Creating a Custom Slowly Changing Package
- Solution — Advanced Dimension ETL
- Snowflake Dimensions
- Parent-Child Dimensions
- Profile and Junk Dimensions
- Date Dimension ETL
- Custom Dimension ETL
- Determining Dimension Changes
- Inserts and Updates
- Summary
-
Chapter 8: Fact Table ETL
- Problem
- Fact Tables: The Basics
- Fact Table ETL: The Challenge
- Preparing the Data
- Mapping Dimension Keys
- Calculating Measures
- Adding Metadata
- Fact Table Updates
- Fact Table Inserts
- Design
- Data Preparation
- Data Preparation with SSIS Transformations
- Data Preparation Examples
- Acquiring the Dimension Surrogate Key in SSIS
- Identifying the Dimension Surrogate Keys with SSIS
- Surrogate Key Examples in SSIS
- Measure Calculations
- Measure Calculation Types
- Handling Measure Calculations in SSIS
- Managing Fact Table Changes
- Approaches to Identifying Changed Fact Records
- Fact Update Examples in SSIS
- Optimizing Fact Table Inserts
- Optimizing Inserts with Fast Load
- Optimizing Inserts with Index Management
- Solution
- Internet and Reseller Sales Fact Table ETL
- Fact Internet and Reseller Sales Extraction and Transform Process
- Fact Internet and Reseller Sales Load Process
- Snapshot Fact Table Example — Call Center Fact Table
- Advanced Fact Table ETL Concepts
- Handling Missing Dimension Lookups
- Handling Late-Arriving Facts
- Summary
-
Chapter 9: SSAS Processing Architecture
- Problem
- SSAS Object Population
- Schedule
- Partition Management
- The Complete Package
- Design
- SSAS Objects and Processing Basics
- Dimensions
- Partitions
- Mining Structures
- SSIS Tasks and Components
- Analysis Services Processing Task
- Analysis Services Execute DDL Task
- Execute Process Task with ASCMD
- Data Flow Destinations for SSAS Objects
- Script Task with AMO
- Creating and Managing Partitions
- Overall Design
- Solution
- Preparation for SSAS Integration
- Process Dimensions Package
- Process Task
- Parallel XMLA
- Process Partitions Package
- Storing and Loading Metadata
- SSAS Processing
- Overall Solution
- Summary
-
Chapter 10: Implementing Scale-Out ETL Process
- Problem
- Design
- Design Components Overview
- Central Common Services Server
- File Processor and Pre-Aggregation Scale-Out Processes Servers
- Design Details
- File Management Tasks
- Data File Management
- Work Allocation Process
- Scale-Out Source File Process
- Work Reassignment Process
- Data Aggregation Tasks
- Hourly Data Pre-Aggregation Process
- Hourly Data Aggregation Process
- Daily Data Aggregation Process
- Archival and Clean-up Processes
- Data File Archival Process
- Stage Table Clean-up Process
- Design Conclusions
- Solution
- Central Server Services
- Multiple File Processor and Pre-Aggregator Processes
- Database Tables Required on the Central Server
- Stored Procedures
- Procedures on the Central Server
- Procedures on the Staging Servers
- SSIS Packages
- File-Processing Server Packages
- Central Server Packages
- Summary
-
Chapter 11: Scripting Design Patterns
- Problem — Advanced File Management
- Script Task
- Scenario
- Design — Advanced File Management
- Script Language
- Accessing Objects
- Custom Assemblies
- Scripting
- Solution — Advanced File Management
- Create an External Assembly
- Access the External Assembly
- Archive the Files
- Summary — Advanced File Management
- Problem — Call Center Fact ETL
- Reasons to Use Scripting
- Scenario
- Design — Call Center Fact ETL
- Component Type
- Output Type
- Synchronous Outputs
- Asynchronous Outputs
- Other Component Types
- Design Choice
- Overridable Methods
- Solution — Call Center Fact ETL
- Package Setup
- Script Component
- Reusable Methods
- Row-By-Row Processing
- Process Input
- Entire Script
- Package Wrap-Up
- Summary — Call Center Fact ETL
- Summary
-
Chapter 12: SSIS Package Scaling
- Problem
- Identifying Task Durations
- Identifying Data Flow Destination Issues
- Identifying Transformation and Memory Bottlenecks
- Identifying Data Flow Source Performance Issues
- Design
- Balancing SQL Operations with the Data Flow
- Data Flow Advantages as Compared with SQL
- SQL Advantages when Compared with the Data Flow
- Applying SQL Code in Your SSIS Packages
- SSIS Pipeline Architecture Concepts
- Data Flow Design Example
- SQL-Centric SSIS Process
- Rewritten Data Flow–Centric SSIS Process
- Solution
- Tuning Your Data Flow
- Use Blocking Transformations Sparingly
- Limit Row-by-Row Operations
- Manage Data Flow Backpressure
- Pre-sort Sources as an Alternative to the Sort Transformation
- Optimize the Lookup and Managing the Cache Size
- Remove Unused Columns from the Data Flow
- Be Mindful of the SSIS Logging Impact
- Regulate Transactions
- Setting Data Flow Properties
- Up the EngineThreads Value
- Optimize the Temporary Storage Locations
- Leave RunInOptimizedMode as True
- Tune Buffers
- Database and Destination Optimization
- Limiting Database Growth
- Consider Dropping and Re-creating Indexes on Destination Tables
- Using the OLE DB Destination Adapter
- Use Advanced Oracle and Teradata Destination Adapters
- Handling Partitioned Tables
- Summary
- Index
- Advertisement
Product information
- Title: Microsoft® SQL Server® 2008 Integration Services: Problem-Design-Solution
- Author(s):
- Release date: November 2009
- Publisher(s): Wrox
- ISBN: 9780470525760
You might also like
book
API Testing and Development with Postman
Explore the world of APIs and learn how to integrate them with production-ready applications using Postman …
book
Microsoft SQL Server 2019: A Beginner's Guide, Seventh Edition, 7th Edition
Publisher's Note: Products purchased from Third Party sellers are not guaranteed by the publisher for quality, …
book
A Guide to the Project Management Body of Knowledge (PMBOK® Guide) – Seventh Edition and The Standard for Project Management (ENGLISH)
PMBOK® Guide is the go-to resource for project management practitioners. The project management profession has significantly …
book
Python Crash Course, 3rd Edition
Python Crash Course is the world's best-selling guide to the Python guide programming language, with over …