Chapter 4. Build a Resilient Data Architecture
Regardless of which data collaboration platform you choose to deploy, your implementation must deliver several core features if you want to achieve a truly seamless data collaboration experience and fully leverage the power of data collaboration. This is vital to note because, although all of the data collaboration platforms we discussed in the previous chapter have the potential to enable data collaboration without integrations, simply deploying those platforms is hardly enough to implement this concept.
Instead, you should strive in your deployments to adhere to the following principles and practices, which are what make a data collaboration architecture different from other types of data integration or pooling architectures.
Start with New Integration Projects
Before we begin discussing specific data collaboration features and characteristics, here is a point of advice: most businesses will find it much easier to apply this concept to projects where new integrations are needed, such as migration of systems, connecting two applications, or updating, enhancing, or building new applications, rather than ripping and replacing to apply it to point-to-point integrations of applications or systems.
We suggest that you begin implementing this technique by deploying new projects according to the data collaboration playbook rather than trying to replace integrations you’ve already built for existing applications with connection methodologies inspired by an integration-free approach.
With that caveat out of the way, let’s move on to a discussion of what it looks like to get started with data collaboration to achieve the seamless integration between business units illustrated in Figure 4-1.
Liberate Your Data
First and foremost, data collaboration is about liberating your data so it’s no longer application-centric. To do this, you need to ensure that every data object produced by every application in your information technology estate is accessible by default to every other application.
You could attempt to achieve this setup using data virtualization or by sharing multiple copies of the same data. But that would leave you subject to some of the key problems we discussed in Chapter 1. You’d struggle to maintain data quality and enforce data governance rules consistently across all data copies.
A better way to liberate your data is to bidirectionally sync a single copy of data across all applications that want to access it. That way, there is only one source of data to manage, and you don’t have to set up complex pipelines or fabrics to move the data between one app or another. Instead, you simply make the data available to any applications that want to work with it. Whenever the data is modified, the changes become visible to all apps because they’re all working with the same data.
Productize Data and Manage It as a Network
Liberating data makes it possible to productize data. Productization of data means making it accessible to business entities on a continual basis.
In turn, productization enables what we like to call “data as a network.” We use this term because one of the chief goals of data collaboration without integrations is to make it as fast and easy for applications to share data with each other as it is for them to connect with each other over a network.
To achieve this goal, you must decouple data from applications. Rather than letting data live inside your apps, store it by default in a centralized location that all apps can access.
Data stored in this way is autonomous. It no longer depends on application-level controls to enforce security or privacy protections, and there is no longer a need for application-specific integrations if you want to move data between applications.
Instead, the data is organized via a federated network. Granular access controls define which applications and people can collaborate around specific data objects or data sets.
With data as a network, applications share data with each other by default. However, with data collaboration free from integrations, your data network becomes the actual operational layer that drives data sharing.
Give Data Autonomous Superpowers
Decoupling data from applications makes it autonomous because data is no longer constrained by individual applications when you want to share or use it. But data autonomy goes further than that.
Data autonomy also means ensuring that your data is self-protecting, self-versioning, self-describing, and self-tracking. When you implement these features for each and every data object inside your data architecture, you get data that is capable of managing itself. Once again, you liberate the data from application-level controls and constraints.
Simplify Data Management
By a similar token, liberated data should be capable of enforcing data quality controls, cleaning itself to remove issues like redundancies, and supporting automated backups.
When your data does these things, it becomes exponentially easier to manage. Rather than parsing through hundreds of applications to find and address data quality problems, you can manage them centrally through your data collaboration architecture. Instead of having to execute complex backup operations to protect data spread across a sprawling set of apps, you can do it all centrally and efficiently.
Maintain Visibility and Governance
The final key piece of the data collaboration pie is visibility and governance. To make the most of the concept, you must be able to enforce federated data controls, implement metadata layer access policies, and ensure that data can be easily tracked and audited. You should also track data engagements through transparent collaboration logs that record which applications or users accessed each data object and how they used it.
These features are especially important when your data is stored on a centralized platform where many people and applications can potentially access it. To prevent data misuses, it’s critical to enforce granular controls that restrict each user’s access to the specific data they need to manage. In addition, access controls should reflect varying user needs. Some users may need read and write permissions, while others require read-only permissions.
By its nature, data collaboration makes these processes simpler than they would be if your data were spread across all of your applications or if you depended on integrations that created multiple copies of your data. It’s much easier to avoid gaps and oversights and to ensure that each data object is protected by the proper controls and access policies when there are no copies of data. Still, enforcing proper access controls for data requires a highly granular and sophisticated set of data protections that reflect varying user needs and levels of data sensitivity.
The features described in this chapter are the fundamental principles of data collaboration, but they’re certainly not all there is to say about the concept. In the next chapter, we’ll look at going above and beyond the basics of data collaboration without integrations by applying the technique to multiple applications, extending the concepts to legacy systems, and more.
Get Moving Beyond Data Integration with Data Collaboration now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.