Chapter 1. What Is Network Automation?
Drivers and Trends in Network Automation
Today’s computers, from modern mainframes and hyperconverged infrastructure to smartphones and Internet of Things (IoT) devices, would be almost useless without networks to connect them. Many network components—from routers, switches, and access points to the software that implements virtual private networks (VPNs) and software-defined wide area network (SD-WAN) topologies—are complex mechanisms that, at least from the user’s perspective, should “just work” all the time, no matter how loads and user needs change. Theoretically, the best possible performance of a network may be obtained with a manually optimized configuration of all of the network’s elements that takes into full account all the components, users, and application usage, along with changes in use of each of these. While this would lead to a perfectly optimized network, working in this way would consume too much time, require so much expertise, and likely still produce enough errors to cause network downtime—after all, there is still room for human error.
Relying on manual network optimization becomes unsustainable in the long run for every organization counting more than a few users or devices under management. In all but the simplest and most stable networks, almost every low-level, manual configuration or maintenance of your network devices represents time and money that could, and really should, be spent somewhere else.
The obvious solution to the problems presented by manual network optimization is network automation, which can occur in several areas. Throughout this report, we will explore these areas of automation available to IT teams and discuss a path toward automating various tasks that today are done manually. If we were to think about each area of tasks that we can automate as distinct “buckets of automation,” we could then measure to which degree we automate each bucket. It is important to recognize that some organizations will never need to go to the maximum degree, or maximum automation, in each area. Other organizations may find that different departments or different network segments may experience different degrees of automation as time goes on.
To try to define the degree of automation, let’s look at a quick example: a user needing VPN access to a new network for a specific project. We can first look at the simplest and oldest type of automation: minimizing the amount of typing required to add a user to the VPN by replacing typing in a command-line interface (CLI) with single clicks in a graphical interface (Figure 1-1). Although some manual work is still required to perform this routine task, the task has become much simpler and, in a sense, some of the work has been automated. The next step is automating the task initiation. Rather than requiring a person from the IT team to click in a graphical interface, the action is initiated automatically based on certain conditions or actions. To a certain extent, this is the “outsourcing” of some operations. In our example of a user needing VPN access for a new project, the user could access a self-serve portal to initiate creating the new VPN directly from their device, rather than submit a ticket to IT and wait for the IT team to complete those few clicks in the graphical interface. In this example, automation nirvana, or the ultimate degree of automation, is where the VPN is automatically configured, enabled, and connected with no human intervention. This could occur as a “just-in-time” provisioning as the user attempts to access the VPN they need to use. In this example, we’ve defined three degrees of automation, and although the number of degrees and exact steps for automating this task will vary, we can see that each degree takes one step toward a fully automated task.
Although our example looked at a user-initiated request, a parallel trend has been to automate more of the manual maintenance tasks that the staff in charge of configuring and maintaining a network routinely complete. The natural evolution has been automatically executing some of the simplest operations that occur on a daily basis, like resetting remote machines or reconfiguring some of their parameters from a central dashboard.
In recent years, the arrival of cloud computing, either inside company-owned data centers as private clouds or by external providers in public clouds, has made it possible for many network staff to provide not just single services but entire infrastructure and software-defined networks on demand, in real time. This is accomplished in two main ways. One is automation, which is sometimes simply the integration of most of the underlying sequential operations that previously happened manually. The second is through outsourcing, as the automation capabilities provided through software-defined networks increase the number and variety of operations that are safe to outsource. The result is higher user satisfaction, much better utilization of the skills and time of network staff, and much less risk.
Today, the digital transformation that is continuously taking place across businesses of all sizes increases the pressure on networks in ways that call not just for even more automation but also for deep changes in its nature. Some of the main factors that contribute to this trend are the accelerated adoption of ecommerce, video conferencing, and remote work in general. Some organizations may add thousands of devices to their networks in a very short time by adopting IoT technologies, for tracking company fleets or movement of goods through their supply chains, for example.
All of these changes are opportunities for business transformation, which are discussed in more detail in Chapter 2, and make networks much more complex and dynamic, introducing problems and needs almost unheard of before. Even when they are not huge in size, today’s networks can quickly become so heterogeneous and so variable in their loads and conditions of use as to resemble some sort of unknowable black holes: places, that is, where it is extremely hard to know what the main problems really are, or even to just detect that there are problems!
These facts shine the light on two areas of network management that are relevant to the question of network automation. One is to make visible what was hidden via much better automated discovery and data analysis so as to make it easier to see which actions or decisions should be taken. The other is to perform at least some of those actions. At the furthest degree of automation, this means to make networks self-healing, or able to reconfigure themselves, when faults happen, new subnetworks are added, or usage conditions change.
This short overview shows how network automation makes any organization with a network (which, if we’re being honest, is every organization these days!) much more efficient. It is now time to examine the main areas in which this automation should happen inside a network, one at a time.
Network Design
Network automation starts with the design of a network. A good network is, first of all, one whose composition, topology, and configuration (its design) are always completely known. To achieve this goal, we may think that we need to automate the design of a new network itself. But it can also start with automating the representation of an already existing one through automated network documentation. With new networks, design automation gives the network designers the responsibility to describe the desired results, such as how many network segments they want, how they are connected, what their connection with the internet is, and how they are protected (for example, allowing only email or web access), as well as the minimum bandwidth to reserve for video conferencing services and so on. With such information, the automation system could then take care of all the low-level details—for example, map out how many switches or routers should be deployed and describe how to connect and configure them. For existing networks, design automation would entail certifying or auditing their actual topology and producing a similar map by probing all the devices active on the network to extract their configuration parameters and infer from them topology and other information.
When a network’s entire actual structure and composition are completely known, every part of the network becomes easy to upgrade, replace, or restore after faults, at the smallest overall cost. As obvious as this thesis may seem, in the real world, “completely known” is exactly where the problems start. System administration folklore abounds with stories of ghost servers or switches, long forgotten in some basement but still running and connected to the internet—that is, equipment that hosts content that should not be there, runs software that begs to be attacked, or, in the best possible case, wastes bandwidth and electricity for no reason at all. Even when no ghost devices are present, company reorganizations, acquisitions, or relocations to new facilities can cause unpleasant surprises, creating designs like those depicted in Figure 1-2. In all such circumstances, if the network’s inventories and maps do not match reality, administrators will consume precious time just to be sure of what they should do first.
In addition to not knowing of a device’s existence and its place within the network, it can be equally easy to forget how some devices are actually configured and why configuration choices were made. When network requirements change and static legacy network device configurations remain in place, bandwidth bottlenecks and other performance degradation (read: unnecessary costs and unnecessary poor user experience) may remain hidden for years. Moving content and services to the cloud, either public or private, may increase the likelihood of such events.
This lack of visibility and inefficient network design should never happen in a good network. The basic methods and approaches to proper network design and visibility are well known, in principle, and are all based on open technologies. There are many ways to collect make, model, serial number, virtual local area networks (VLANs) and IP addresses, address resolution protocol (ARP) tables, and all other details for every device on the network. Standard protocols and procedures, from wire tracing and port mapping to Cisco Discovery Protocol (CDP) and Link Layer Discovery Protocol (LLDP), can gather all the data that a technician needs to infer and draw a full diagram of the whole network and understand the entire network design. The point is, neither the collection of that raw data nor its formatting and presentation should ever be done, or kept current, manually. Doing so would surely cost more in staff time, without any guarantee that errors would not be made, than using solutions already tested in many other organizations. The same is true, in almost all cases, for carefully crafted in-house custom scripts and tools that invariably end up consuming much more time in maintenance than originally expected.
Asset inventories and everything else that is necessary to have complete network visibility and control in real time should be a given, not something that requires constant intervention or dedicated manual efforts. Maps that can show the exact design of a network, with detailed visibility into the location or switchport connections of every device, should constantly update by themselves, as soon as the network changes. Even higher-level operations—for example, partitioning a network in semi-independent zones that can be independently managed or updated one at a time, without affecting all the others—should happen with as little manual work as possible, following consistent but automatic procedures.
Network Configuration: Policies
A perfectly mapped network is still an ugly place without rules on how to use it and adapt it to its users’ needs that are easy to set and follow without ambiguities by all interested parties. The most common but by no means the only examples of such rules and procedures are those used to define bandwidth caps, access-control lists (ACLs), user quotas, password policies, and firewall rules. As we think about automating these rules and procedures, we can think about automation both in terms of policy definition and policy enforcement. Automating policy enforcement is exactly what network devices like firewalls are designed for, so when we speak of automating network policy, it is very much about the policy definition or configuration. When putting together the description and enforcement of exactly how users must or can use assets, it’s important that those tasks be automated to consume as little IT staff time as possible. Besides reducing the daily load on network administrators, this automation of policy definition brings two other big advantages: consistency and (self) documentation. To expand on these, if policy definition has been automated, all policies will follow the same format and structure and will look consistent to the readers of your network design and documentation. Along with this, the documentation of policy configuration and changes can be automated at the time of policy creation, so your network documentation is always up to date.
Network Configuration: Provisioning
Clear, concretely enforceable rules on how to use the network are of little use if the network itself, or the services it makes accessible, are hard to change. With infrastructure as a service (IaaS), distributed teams, and work-from-anywhere becoming increasingly common, it becomes necessary to provide applications, services, and general connectivity to any combination of local and remote hardware and virtual platforms. To understand when, how, and why this provisioning could happen in practice, it is easiest to consider a software development application: let’s consider the developers of some real-time collaboration software that need to reproduce a reported bug. In such a situation, those developers would work much better if they could reproduce the issue exactly as the client sees it, in the exact same network where the bug was first noticed, and at minimum cost. They would need a virtual network to play in, maybe for just a few hours or days, but with virtual switches, virtual firewalls, and so on that both reproduce the desired conditions and keep that area completely isolated from the rest of the network. Other examples may be a company that needs to set up a product demo at a conference or a university that must run final exams in a temporary but tightly controlled network to avoid cheating. Both would have very similar needs and would benefit from streamlined, automated provisioning.
These are just a few examples of why, to keep up with the pace of business, adding users, LANs, VPNs, virtual switches or firewalls, and more must be possible in real time, in ways that are transparent to end users and, to some extent, also to the network staff. In a fully automated workflow, for every situation like those just described, the users should ideally be able to describe what they need to do and under which high-level conditions without having to configure intricate technical details manually. For example, “emulate a running website with up to a hundred simultaneous users, each with at least upstream X bandwidth, but isolated from the real internet” is a description of a high-level network to provision, without the need to include all the details. In other words, as far as provisioning is concerned, network automation must make it possible to perform and coordinate all these tasks always in the same way, from the same interface, regardless of where the interested software and physical devices are, and by describing the desired outcome—that is, the final status the network should be in—rather than which options should be set to get there.
Life Cycle
Networks are most valuable when they are reliable, and a reliable network depends on managing the full life cycle—from initial deployment to end of life—of all of the underlying infrastructure that keeps the network up and running. To start, the predictable, regular updates of firmware and software are the simplest of several life cycle issues to consider. Company acquisitions or opening new remote offices are much more complex, but they are likely to happen—in most cases, at least—with enough notice to allow proper planning of how the network should be expanded or redesigned.
A number of less predictable updates occur throughout a device’s life cycle as well. Take identified vulnerabilities and the subsequent security patches as an example. This example is well positioned for automation, as security advisories are released without notice and often need near immediate reaction—little time to plan manual activities. Let’s look at security patching as another example of a progressive automation. As a first step, a properly automated network should spot and report automatically every security advisory or software update that affects any of its devices as soon as it is announced. This is an incremental process, as shown in Figure 1-3, as the maturity of the life cycle automation increases. Similar notifications and reports should be issued for ordinary new releases of firmware or software, indicating which specific devices should be updated but at first leaving to the administrators the responsibility to push those updates manually. As you increase the degree to which the security patching is automated, these manual updates become automated: first by enabling the IT administrator to simply initiate the process and confirm the result, and eventually without any human intervention or oversight at all. It must be stressed that all of this monitoring should happen regularly, by itself: effective, real-world automation is not a series of fire-and-forget actions but a self-sustaining, incremental process. Even nontechnical notifications, like approaching expiration of support contracts or the mandatory phaseout of some product, should be issued and reported by the network automation system, in one place and one coherent format, to give full visibility of what lies ahead. Ideally, network managers should always have available, in any moment, the exact, complete answer to questions like: if one of my devices fails, am I able to replace it with a similar device, or are those devices no longer available for purchase? As far as it is concerned, the network automation system should contribute to the answer by being able to list, in addition to all the parameters mentioned previously, the exact capabilities of each device.
As an extension of life-cycle automation, there are continuous changes to compliance requirements with new regulations for privacy, data protection, employee safety, and financial transparency. General Data Protection Regulation (GDPR), the Sarbanes–Oxley (SOX) Act, and the Health Insurance Portability and Accountability Act (HIPAA) are only three of the many regulations that put concrete obligations on company networks in the United States, the European Union, and beyond.
While we often think of these frameworks as putting obligations on data, it is worth noting that these acts have an impact on networks as well. They routinely mandate what a network must guarantee (i.e., uptime) or prevent (i.e., reduce risk) and also how to present the corresponding data about the network—for example, through reports in the Information Technology Infrastructure Library (ITIL) standard format.
All of these reports should not be prepared only when an audit is coming. They should always be there, already ready for external or internal audits, courtesy of the network automation services. The same services should also continuously work to maintain compliance, refusing—or at least warning against—any change to the configuration of the network that would end compliance with some regulation.
Get Network Automation Roadmap now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.