Integrated color
Integrated color (source: Pixabay)

Organizations in the private and public sectors alike are looking for ways to integrate relevant data across the enterprise in support of business, operational, and compliance needs.

Big data (also called NoSQL) technologies facilitate the ingestion, processing, and search of data with no regard to schema (database structure). Web technologies such as Google, LinkedIn, and Facebook use big data technologies to process the tremendous amount of data from every possible source without regard to structure, and offer a searchable interface to access it. Modern NoSQL technologies have evolved to offer capabilities to govern, process, secure, and deliver data, and have facilitated the development of an integration pattern called the operational data hub (ODH).

The Centers for Medicare and Medicaid Services (CMS) and other organizations (public and private) in the health, finance, banking, entertainment, insurance, and defense sectors (amongst others) utilize the capabilities of ODH technologies for enterprise data integration. This gives them the ability to access, integrate, master, process, and deliver data across the enterprise.

Traditional model and data silos

For decades, the standard pattern to produce operational data and enterprise analytics was to develop data warehouses with data schemas dedicated to the purpose of use. Let’s consider an example: an HR department required detailed analysis of human resource data. The development team was engaged to elicit requirements for the reports that would be generated and design a database schema to store that data. Data feeds were developed to pull HR data from all relevant systems (such as payroll and vacation registers), then insert or update tables in the data warehouse to build the required analytical data. Once completed, the HR director could pull metrics on trends in pay raises, tenure, and paid time off.

However, if additional information such as trends in employee satisfaction scores were required, the development team had to be engaged again to elicit requirements, source the data, determine the impact on the database and then build the processes that updates the data warehouse. This process had to be repeated every time the data warehouse needed updating. Each update to the data warehouse typically included a tremendous amount of development and testing to ensure the updated schema did not break existing code. For this reason, the level of effort for analyzing and implementing any change was typically enormous.

Each department, having developed its own operational systems and own data warehouses, could execute business processes and draw analytical information. However, this practice caused isolation in information technology resources—referred to as “data silos.” It is very difficult to draw analytical correlations across data silos. For instance, if a CEO wanted to know the impact of seasonal staff-turnover on the ability to fulfill product delivery and shipment, it would require that HR, sales, production, and shipping data be correlated over time. The effort involved typically resulted in huge delays in time to produce it and significant cost.

Many organizations used relational technologies to implement enterprise data warehouses (EDWs) across the relevant data silos to answer enterprise-wide questions. However, these EDWs suffer from the same challenges as their smaller, departmental cousins. The effort associated with designing and implementing the schema, data extracts and data feeds are significant. Once developed, changes are typically not any easier either.

How are things better with an ODH?

An ODH combines the flexible schema processing capabilities of NoSQL technologies with the governance, rigor, and transactional integrity of relational technologies. To illustrate how an ODH would be helpful, let’s consider the example provided above. Since an ODH is built on a NoSQL technology, and NoSQL technologies allow data to be ingested without consideration to schema, the organization can start ingesting available data in raw format into the ODH. Our organization has the following systems across the enterprise:

  • A payroll system that includes employee, position, benefits, and payroll information
  • A vacation register that manages, approves, and tracks paid time off
  • A training system that tracks compliance training and job-related training
  • A product management system that manages product development and parts ordering
  • Warehouse management that tracks products on hand and manages shipping
  • An order management system that manages sales and customer information
  • A customer relationship management (CRM) system that manages customer information and tracks sales
  • A document management system that manages electronic versions of paper documents

The files that the payroll system, vacation system, and training system exchange with one another to coordinate HR information can be ingested into the ODH in raw format. The same can be done for the warehouse management, order management, and CRM systems. Data can be ingested directly from the product management database and the documents in the document management system. This allows for structured files, unstructured (document) files, and database content to reside together in their native formats in the ODH, where they can be indexed, processed, and searched. Thus far, the only additional efforts expended are:

  • the data queries (simple SQL) from the product management system
  • the processes to ingest the files from the existing interfaces
  • the processes to ingest the PDF files from document management system

So, with very little effort expended, we have all the data across the enterprise in a single location, with metadata specifying sources of data. Data analysts can now query data across these sources to find answers to the types of questions that the CEO requested and store the results for future, quick, search.

However, the true value of the ODH is realized when we leverage the data governance, processing, and consistency capabilities to establish data processing patterns upon ingest. In addition to ingesting the raw data, additional processes can:

  • group cohorts of data based on identifiers or configurable fuzzy logic
  • apply an in-place harmonized (canonical) model of data elements
  • apply data quality updates
  • create master records with updates from disparate systems

Let’s see how this applies to our operating example. The IT organization can set up scripts to progressively apply a logically translated common data structure (canonical model) over time so that stored data can be easily processed and searched. Scripts are developed to group cohorts of data, such as clients, vendors, or employees. Updates to these cohorts across source systems maintain a central, mastered copy of the record. If, for instance, a client updates his/her address, we don’t have to rely on fragile point-to-point integrations between the CRM, order management, and warehouse management systems. The update is processed centrally and the ODH distributes the data to the transactional systems for further processing. Just as it is schema-agnostic during ingest, the ODH also allows for configurable schema mapping upon data delivery. This allows for translation of data upon ingest, and for processing and delivery to ensure maximum flexibility during data distribution.

Why do we need operational data hubs? We need them to facilitate enterprise data integration with the flexibility of big data/NoSQL technologies, but with the added rigor, governance, and consistency required in an enterprise environment. The ODH facilitates data exchange across the enterprise and allows for analytical processing of raw or mastered data at a fraction of the cost of traditional technologies.

This post is a collaboration between O’Reilly and MarkLogic. See our statement of editorial independence.

Article image: Integrated color (source: Pixabay).