Chapter 4. Managing Knowledge

Knowledge management is the process of coordinating and managing the information required to maintain and operate your applications, including configuration, secrets, and documentation. Effective knowledge management involves making sure this information is in as uniform and consistent a format as possible, available as needed to everyone responsible. Knowledge management is also about reducing the amount of knowledge individuals need to operate and maintain a system. It helps ensure that everyone at the organization—from new employees to seasoned practitioners—is as productive as possible.

Understanding and maintaining complex systems tends to require a significant diversity of knowledge and expertise. The more complex a system is, the more knowledge is required to maintain it. Furthermore, the cost and difficulty of building and maintaining a system are directly proportional to the knowledge required to build and maintain it. Figure 4-1 shows this relationship.

Generally speaking, the more knowledge a system requires, the harder it is to maintain that knowledge and the more complex the system is.

Understanding a system whose operation and maintenance require diverse expertise also involves a high cognitive load. Cognitive load is a measure of the amount of information someone can hold in their working memory at one time. The more unique the knowledge requirements are, the more context you must keep within your mind in order to understand them, and hence the greater that load. Knowledge management helps reduce cognitive load, making building and operating the system more efficient.

The interrelation between quantity and diversity of knowledge required and system complexity
Figure 4-1. The interrelation between quantity and diversity of knowledge required and system complexity

Knowledge Variability: Choice Versus Complexity

A dichotomy exists. The more choices you give your teams in how they can build, develop, and maintain an application, the more innovative they can be. This typically results in faster time to market, more competitive products, and ultimately more satisfied staff.

But this increased diversity introduces complexity, which means a greater amount of knowledge is required to understand how the system operates.

As a result, the very characteristics that bring you market value and success in the short term—specifically, innovation and flexibility—result in increased cognitive load, technical debt, and complexity in the long term. Figure 4-2 illustrates this.

The more diversity and innovation allowed in the early application development process, the greater the likelihood of market success and the more value you bring to the market. But those same qualities also increase long-term cognitive load and application complexity, adversely impacting performance and profitability.

Innovation brings market value  but also complexity
Figure 4-2. Innovation brings market value, but also complexity

Have you ever decided to implement a quick and dirty feature because you needed to get it to market quickly, despite the additional complexity and technical debt it brings to your application in the long term? This is innovation at work, introducing short-term value at the cost of long-term complexity.

Effectively managing knowledge is fundamental to reducing complexity in a system, which is required to reduce cognitive load and ultimately improve maintainability. Long-term knowledge management is often at odds with innovation and choice, and the balance must be managed to create both short-term and long-term value.

The intent here isn’t to say that innovation is bad. On the contrary, innovation and diversity in thought and process are critical to the success of an application (and a business). However, it’s important to be aware of the sometimes-hidden long-term costs and risks associated with this mindset.

That is, when innovating and making short-term decisions, you must factor into your deliberations the long-term cognitive effects of those choices. Innovation at all costs is a recipe for long-term disaster, but managed innovation can help create a long-lasting product offering.

Managing Knowledge Requirements

The goal of knowledge management is to gather together all the available information about the tools, systems, processes, procedures, and requirements that are part of the system and that keep it operating efficiently, without overcorrecting and being overly restrictive, which can stifle innovation.

How do you control the knowledge that a system requires? Interestingly, by using the same tools, techniques, and processes you use to manage overall technical complexity in a system:

Understanding (measuring)

You can’t manage the knowledge requirements of a system until you understand the breadth and scope of knowledge required to operate it.

Loose coupling

Keeping systems independent from each other—reducing the dependencies between them—decreases knowledge requirements. Dependency management, side effects, and unexpected outcomes are all problems associated with highly dependent systems. Independent systems avoid these complexities and require less knowledge to operate than highly interdependent systems do.

Standardization

Using standardized methods, procedures, and processes provides a structure to create reusable components that can be leveraged in common ways. Standardization is key to keeping systems simple and cognitive load low.

Reuse

In standardized systems, the reuse of knowledge, configurations, and information (and components and infrastructure) is easier to accomplish. Leveraging commonality in systems and keeping systems consistent and regular reduces the cognitive load of understanding how an application functions. The more you make your services and systems operate similarly to other services and systems, the less variability there is—that is, the less additional knowledge is needed to use them. This lowers the cognitive load involved in dealing with complexity.

When managing complex systems, reducing the amount of knowledge required to operate them and the variability in that knowledge has various advantages. These include:

Greater productivity

Simpler systems are easier to understand, and hence allow new employees to become more efficient at using them quickly. This reduction in time to value boosts productivity.

Increased supportability and uniformity

Simpler systems relying on standardization and reuse have more supportable components and more uniform adoption of best practices. Reuse improves resilience and reliability, which improves supportability. Uniformity reduces complexity.

Centralized Configuration

A common issue with production applications is where to store and manage the configuration, setup, and other data used to operate the application. This includes things like database access credentials, third-party service activation tokens, network router and switch configuration files, firewall configurations, cache setup parameters, database configuration information, and server configuration files.

This information is often stored at the point where it’s needed. For example, as shown in Figure 4-3, network devices (firewalls, switches, routers) store their configuration information within the devices themselves, the web server configuration is stored with the web server, and the database configuration is stored with the database. The credentials required to use these capabilities are stored in the application that is using the various devices and services.

Configuration throughout an application and its infrastructure
Figure 4-3. Configuration throughout an application and its infrastructure

Storing the information needed to set up and manage a service directly with the consumer of that information may seem like a good model. There are, however, several problems with this approach:

Security/vulnerability

When you store credentials with the user of those credentials, if the application becomes compromised, the referring services whose credentials you are storing also become compromised. This is bad practice from a security standpoint.

Consistency/reusability

When you store the individual configuration files in the devices themselves, there is no central knowledge of how the systems are set up and configured. This means you can’t compare the configurations of, for example, one network router to another to see how they differ, or update one to match the other. In turn, this means you can’t easily apply the best practices used in one configuration to another.

Availability/safety

Storing configuration files in the devices themselves also means there is no redundancy or backup of that information. If a network switch, for example, fails and needs to be replaced, how should the new switch be configured? When configuration files are stored in the devices themselves, if a device fails (or becomes compromised) and needs to be removed, the configuration is lost. If a backup is not maintained outside the device, replacing it involves the additional effort of reconstructing the requirements for the configuration.

Traceability

If someone changes the configuration of a needed resource, and that change ends up causing an application outage, without centralized knowledge it can be difficult to discover what changed recently, which means it can be difficult to find and repair the damage and mitigate the downtime. Traceability has long been recognized as valuable in software code, but it’s equally important in infrastructure configuration and setup. Traceability isn’t about blaming the person who caused the problem; it’s about identifying what happened to allow the issue to be resolved more quickly, and (in the longer term) creating processes and systems to make sure the same problem doesn’t occur again.

Security, reliability, consistency, and traceability are all impeded when the configuration information is stored with the resource itself—and as we saw in Chapter 1, this can have serious consequences.

The harder it is to create, update, manage, verify, or reuse an infrastructure configuration, the greater the technical debt in your application, and the greater the overall operational complexity. To combat this problem, a system for centrally managing and controlling the configurations is required.

Maintaining a Centralized Single Source of Truth

The best approach to take to control and reduce the complexity associated with configuration is to maintain the configuration information centrally, outside the application and infrastructure resources themselves. This model, which relies on a single source of truth, protects your application’s configuration and simplifies your problem diagnosis and resolution processes. Centralization of the information allows reuse and repeatability; as we have seen, this reduces complexity and improves consistency, which enhances reliability.

Figure 4-4 illustrates this approach. In centralized configuration management systems, configurations for all components are stored in a single, common location. Then, when a change is made to any of those configurations, the new version is pushed to each and every corresponding device, updating its internal copy. The authoritative versions of all configurations related to the system are stored together, off-device, and can be manipulated as a set. This centralized storage model makes it easy to track the configurations and any changes made to them. Additionally, all changes are made in a well-known location that is easier to access than the remote devices themselves.

Centralized configuration management
Figure 4-4. Centralized configuration management

Centralized source of truth versus single source of truth

Maintaining a single source of truth does not necessarily mean maintaining a centralized source of truth. The two concepts are different and come with different advantages and disadvantages:

  • A centralized source of truth is where all configuration and management information is maintained in a single, centralized location.

  • A single source of truth is where a specific piece or type of information is maintained in one location only. It can be replicated to many alternate locations (backups, different locales, etc.) and pushed to the location where it is consumed (the resource that requires the configuration), but a single copy is maintained and managed in a single location, and that is the authoritative version.

To better understand this distinction, take a look at Figure 4-5. With a single source of truth, information for a given resource is managed in a single location using a single management tool. Anyone needing to make changes to the configuration must make them in this one location. After the changes are made and have gone through any needed approval process, they are pushed to the resource(s) that require the configuration.

 

Multiple single sources of truth are decentralized
Figure 4-5. Multiple single sources of truth are decentralized

A single source of truth has the advantage of consistency and reliability. All people making changes use a consistent process for making those changes, and all changes are tracked in a single location. However, different resources can use different locations to manage their “single source.” As Figure 4-5 shows, different teams (network teams, application teams, database teams) can manage their configurations independently and in different manners, while still benefiting from the advantages of those configurations coming from a single external source. That is, individual teams can store configuration files for their applications in different locations than other teams. Each configuration has a “single source,” yet there is no centralized management of the configurations.

A centralized source of truth is a step beyond a single source of truth. In a centralized truth model, all configuration for the entire system is maintained in a single location using a single configuration tool. This is illustrated in Figure 4-6.

Centralized configuration across teams and resource types
Figure 4-6. Centralized configuration across teams and resource types

Here, the configuration for all resources is managed in one location, and all teams use the same tool to make all configuration changes for all parts of the system.

In addition to all the advantages of a single source of truth, a centralized source of truth also makes it easy to compare configurations to see how they are different: “Are the security settings of Router 123 different from those of Router 382? Why?”

Centralized truth facilitates the use of standardized best practices, reuse patterns, and layered configurations in a more consistent manner, allowing a more dramatic reduction in application complexity. While accessing a single configuration source from all parts of the application may not always be the most convenient solution, using this model more strongly encourages consistency and reuse than a simple single-source-of-truth model.

Revision management

Once the configuration is centralized, you can use revision control to manage it. Git, the tool used by software engineers worldwide to manage software source code, can just as easily be used to maintain configuration and system files. Doing so gives you all the benefits of revision management that apply to source code:

  • Configurations can be backed up and maintained consistently in multiple redundant locations.

  • Updated configurations can be compared to previous configurations to make sure that only the desired changes have been made and no unnecessary, undesired, or incorrect changes have accidentally been included.

  • Configuration changes can be peer reviewed before they are deployed.

  • Full revision control and approval workflows can be implemented to ensure that only correct and desired changes occur.

  • If a decrease in performance or reliability is observed, revision history allows you to go back and see what changed at the point in time when the problem occurred, as an aid in diagnosing how to resolve the issue.

  • With proper processes in place, it becomes difficult for bad actors to make configuration changes, and when they are made, the changes are easy to identify and resolve.

Are You in Control?

You might be thinking that this all sounds obvious. Of course you don’t store configuration information exclusively in the devices themselves. You keep backups and ensure redundancy, and you use software provided by the network resource vendors to manage your network resources. This is best practice, right?

Well, you still might be more exposed than you think. For example, do you maintain off-site copies and single-source-of-truth versions of the following configuration files:

  • Apache configuration files

  • /etc/hosts files on your Linux servers

  • /etc/sysconfig/network-scripts/* network scripts on Linux servers

  • AWS IAM policies for the thousands of AWS components you use

  • Tuning variables for your cloud-hosted MySQL database

You might be surprised by the breadth, scope, and interdependence of configuration information your complex application requires. Understanding where all your configuration requirements exist and how best to centralize them will require a complete audit and assessment of your application—the type of audit and assessment we discussed in Chapter 2. Once you have done this assessment, you’ll understand where your configuration-based knowledge is stored, and you can create a strategy to manage it in a centralized single source of truth.

Reuse

Centralized knowledge management facilitates reuse. How so? Suppose all your configuration files for your applications, devices, services, and components are in one location. In that case, you can easily leverage information from one configuration in another, perhaps for a similar device. Reuse reduces reinvention, making it less likely that multiple distinct solutions to the same problem are employed. This limits variability in design and hence decreases the amount of information needed to understand how the system as a whole is operating, reducing complexity and cognitive load.

There are multiple patterns for implementing reuse, each with a distinct set of advantages and disadvantages. I will focus on two of them here.

Pattern 1: Copy/Paste Reuse

With the copy/paste reuse model, a system, infrastructure, or security engineer or architect looking for a configuration or design pattern to accomplish a particular task searches through the pool of existing systems for an implementation of a design pattern that appears to fit their needs. They copy the design pattern into their own system, make any adjustments as needed, and they’re done! They’ve leveraged an existing configuration to create a new configuration, relying on reuse to make sure the new component is configured as similarly as possible to the existing component.

Advantages: Copy/paste reuse reduces the time it takes to create new configurations by leveraging known, working configurations. In addition to reducing time to completion, this approach reduces errors in the initial implementation by reusing a working implementation.

Disadvantages: Once the configuration has been copied, it no longer tracks changes in the original configuration. The new configuration and the old configuration can drift apart over time, and eventually may no longer be similar. This tends to cause complexity to increase little by little as time passes.

Pattern 2: Layered/Template Reuse

With the layered reuse model, rather than copying a configuration from one component to another, you create a shared layer of configuration that is common between the two. This configuration template is shared by both components. If a change is later made to that template, it propagates down and is automatically included in all configurations based on the shared layer, keeping them consistent.

Advantages: This has all the advantages of the copy/paste reuse model but eliminates the risk of slow divergence and increasing complexity over time, keeping the different configurations in sync with each other.

Disadvantages: More work is required both to create the shared layers/templates and to set up the necessary automation to use them in any given configuration, including syncing the changes automatically. Additionally, layered reuse introduces a danger: since a shared template is used in many places, a change to it for a given purpose may have unknown and undesired side effects in other places. Change management, change versioning, version pinning, continuous integration/continuous delivery (CI/CD) tooling, and change reviews can help mitigate this issue and ultimately create an environment where there is substantially less complexity because there is significantly more commonality and reuse among similar configurations.

Example: Router Configuration Files

A good example of the power of centrally managed configuration and reuse is the management of network router configurations.

A network router is used to route traffic from one spot in a network to another. Which traffic it allows to pass through can vary because of built-in firewalls, security profiles, network shaping, and routing and redirection rules. Additionally, where traffic is routed may change based on the type of traffic and the desired destination. Each of these routers has a configuration that dictates what rules the device must follow in how it routes traffic. Managing these rules can be extremely complicated, especially in a large enterprise that may have hundreds, thousands, or even tens of thousands of routers, switches, and other networking infrastructure components, all working together to create a safe, functional, secure environment for applications to operate in.

When a network has only a few of these routers, they can be managed by simply updating the configuration on the routers themselves. Most low-end routers even have web-based setup pages that allow an engineer to update the router’s configuration on the fly; your home network probably has at least one networking device (perhaps a WiFi router) that offers such a configuration option.

Figure 4-7 shows an example of a router with an internal configuration. This router is configured via a web browser or API calls, and as you can see, network engineers in various locations can update the configuration as needed.

Multiple people using multiple methods to update the router s configuration
Figure 4-7. Multiple people using multiple methods to update the router’s configuration

When several people are working independently on the same configuration, the changes often collide or conflict with one another. One person might make a change that causes another person’s changes to fail. The result is an unstable router, which leads to an unstable network. Additionally, if that router breaks and has to be replaced, the replacement won’t have the same configuration, and hence all the knowledge and expertise that went into creating the configuration in the first place will be lost. Furthermore, in an enterprise setting this router won’t be alone in the network, and it will need to work with perhaps hundreds or thousands of other routers. Following this approach, each of these routers will have a distinct and independent configuration that has been hand-tailored by many different people. As time goes on, the configurations of all these devices become more and more customized, more and more specialized. Because they are each unique, the complexity of the system as a whole is very high—all because of how these routers’ configuration files are managed.

Now take a look at Figure 4-8. Here, the web page and API calls that were used to configure the router have been disabled, and configuration via these means is not allowed. Instead, a copy of the configuration is stored off-device, in some centralized location. Every engineer who needs to make changes to the router’s configuration makes their changes to that off-device copy. Once those changes have been approved, the off-device copy is pushed or deployed to the router in order to make them go live. The only allowed way to make a change to the router’s configuration is to modify the off-device copy, then deploy it to the device, and a history of these changes can be preserved.

This model has many advantages:

  • All changes people make are centralized and can be examined by all interested parties before they are deployed to the router. This reduces the likelihood that a change made by one engineer will have a negative or unforeseen effect on the changes made by another engineer.

  • Each change can be logged in a revision control system so that changes can be tracked. If a network problem occurs, the history of modifications can be easily reviewed to try to determine which change may have caused the problem. The change can even be rolled back if necessary and the router restored to a previously good state until the problem can be fully investigated.

  • Changes can go through a test and review cycle before they are deployed to the live production router. This may even include pushing to a staging router to verify the updated configuration works as expected before deploying it to the production network.

  • Since all configurations are in a central location, they are available for inspection and review. This encourages reuse, reducing overall complexity.

  • Pushing the changes to production requires some form of CI/CD pipeline. This means the process of using shared layers and templates can be automated easily.

  • If a router fails, it can be replaced with a new one and an up-to-date configuration file can be pushed to the replacement router instantly, immediately getting it into the same state as the old router. This simplifies hardware maintenance operations.

In a centralized configuration model  changes are made off device and pushed to the device
Figure 4-8. In a centralized configuration model, changes are made off-device and pushed to the device

Using centrally managed configurations reduces the overall complexity in the system, improving reliability, accountability, network availability, and, ultimately, application and company success.

Summary

Simplifying knowledge requirements and knowledge management is an important part of reducing system complexity and cognitive load. Maintaining centralized configurations is one strategy to improve knowledge management; in addition to all the benefits just described, it boosts confidence when making changes and limits the business risk involved in managing your application.

But knowledge management is about much more than centralizing configuration files. It’s about providing methods to reduce the amount and diversity of knowledge about the system that is required to maintain and operate it, and about promoting reuse, simplification, and standardization, without jeopardizing the value of moving quickly and encouraging innovation.

Effectively managing knowledge requires finding the right balance between agility and uniformity, speed and completeness, complexity and understandability. Ultimately, knowledge management is about balancing short-term agility and long-term reliability in a complex system.

Get Overcoming IT Complexity now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.