Chapter 4. AI and Analytics in Edge Systems
Edge computing is about much more than just the edge.
Done well, edge computing is a finely choreographed combination of large-scale data, analytics, and AI, often happening rapidly and spread across many locations, and all communicating back and forth to a centralized core. Edge data can arrive in stunningly large amounts, and dealing with it from multiple edge locations, and collectively at the core, requires a sophisticated scale-efficient system if it is to be successful.
Large-scale industrial IoT use cases are what generally come to mind first as people think of edge computing, but while these are classic examples, edge-use cases extend to many sectors. The fact is, the business doesn’t happen just in a data center; many organizations must deal with data, analytics, and/or AI out in the real world, where transactions take place. If you work in retail, the financial sector, or web-based services, you may be surprised to know that the challenges your organization faces—and the solutions they require—share many similarities with those of big IoT edge systems.
With edge situations, just getting data usually isn’t sufficient. Action needs to be taken at the edge as well. Insights needed to direct that action are based on a global view gained by analysis or modeling of data from many edge locations. The idea is that by positioning computers near the action, yet having them as a group interact with a core data center, you can measure and act directly wherever needed.
That much is simple. At this point, however, details begin to matter, and the particular goals of edge computing in different situations and different businesses may look very different. But the important and shared challenges they face are, happily, addressed by the same solutions that underlie scale-efficient AI and analytics systems in general.
Note
The fundamental approaches we have described that enable scale-efficient systems also apply to meeting the challenges of edge systems.
Not only do the same general principles that we have outlined in this book about AI and analytics systems at scale still apply at the edge, in many ways edge amplifies the need for them. Choosing designs appropriate for scale efficiency, and choosing data infrastructure to support these designs, frees you from the unfortunate trade-offs we described in Chapter 1. There are, however, some extensions to these principles required for the specialized situations that arise with production edge computing. For example, many edge locations require highly reliable, highly performant, cost-effective hardware with a small footprint.
We have worked with companies with very different kinds of edge systems, but in spite of the differences, most of them share common challenges that include the need to:
-
Capture data at the edge
-
Use analytics and AI for data reduction
-
Use analytics and AI to take action at the edge
-
Move data back to the core from the edge
-
Move analytics applications and AI models out to the edge
-
Use analytics and AI to gain global insights
-
Manage fleets of edge systems
-
Maintain design freedom
-
Do all this while maintaining security
Some of the systems also had additional challenges such as massive amounts of data, intermittent network connections, or the need for regional peer data-sharing. But all of these challenges can be substantially simplified by the same approaches of good design and appropriate data infrastructure.
Before we go further into the challenges and the scale-efficient solutions used to address them, let’s dig a bit deeper into some background about what edge actually is.
Edge Means Many Things
Edge computing really just means computing that occurs outside a data center, which is happening a lot lately. Traditionally, edge computing was limited to a few industrial contexts, largely due to cost and complexity issues with available technology. Large industrial plants such as factories, refineries, chemical plants, and oil exploration and extraction facilities were the primary situations in which edge computing could provide a viable return on investment. Even so, the role of edge computing was typically limited to a few key digital control and logging functions. Bringing data back to the core, usually involved a truck, helicopter, or dial-up modem rather than the internet.
Things have changed. The formerly very limited scope of edge computing has exploded. Industrial applications are still important, but the scale of data they measure and the sophistication of the computation done at the edge have grown dramatically. In addition, the amount of data transferred to and from core computing facilities has increased by six to ten orders of magnitude.
Other kinds of edge computing have proliferated as well. The industrial edge still exists in dramatically expanded form, but distributed data acquisition applications have become quite important. These systems often acquire data not only from localized installations, such as factories, but also from distributed physical systems, such as pipelines, electrical grids, or office buildings.
Another category could be called consumer edge. It involves very small computers that measure and send small amounts of information (typically limited to about 1–2kB/second, at most) via mobile internet connections to websites. These applications can be found in fitness monitors, phones, and power meters; in connected cars and even in consumer electronics like doorbells, thermostats, or home weather systems. In these systems, the number of edge computers can be in the hundreds of millions.
A fourth category of edge systems includes telemetry collection and alerting from digital systems such as content delivery systems, telecommunications networks, warehouses, and retail stores. These systems are similar to distributed industrial edge systems, but they often require more sophisticated computation to be done in the edge location.
A final category of edge computing includes very large-scale data collection systems (e.g., those employed by autonomous car development efforts) or some scientific systems (e.g., radio telescopes). These systems can return tens of petabytes per day and often require very sophisticated processing to reduce the data to this still gargantuan level before extracts and summaries are forwarded to central facilities. These systems often have very advanced edge data systems just to maintain the metadata about which data has been collected and extracted.
These categories are not necessarily either comprehensive or even clearly delineated. Some systems exhibit properties of more than one category. For instance, a consumer edge system probably has a distributed set of web servers that require telemetry collection. Similarly, some consumer edge systems might collect data that is merged with distributed industrial data to provide hybridized data sets. These categories are useful, however, in understanding the production and scaling requirements of a system, and in predicting what kinds of issues will arise.
Edge systems in each of these categories can be characterized in terms of several factors:
-
The data sources—How many sources? Will the number grow? How much data per source? How long is the data retained? How much is sent to the core? What latency and data loss are acceptable?
-
Geographical distribution—Are sources grouped? Are they spread widely, possibly internationally?
-
Data ownership—Who owns the data? Who can use it? Are there limits on international data motion? How can data be mashed up?
-
Data volume and workload—How intensive is the edge computation?
-
Type of hardware at the edge—What kind of machines are at the edges? Are they nanopower sensors powered parasitically by light and vibration? Are they Raspberry Pi-sized devices? Or are these edge systems composed of several racks of computers?
-
Security—How do we know if the edge has been hacked, or data has been corrupted? How do we prevent it?
-
Management—Are the edge nodes a fully controlled fleet or a consumer app?
Most of these factors, such as the number of data sources, can be straightforwardly understood. Some, however, are more subtle in their effect and can be harder to understand. In subsequent sections of this chapter, we explore geo-distribution, high-volume data ingestion, security and ownership, and, finally, management of edge devices.
Geo-distribution
It is almost a given that an edge-based system will involve some significant amount of geographical distribution. That is one of the biggest reasons that having a unified data layer with efficient platform-level capabilities for data motion is essential for scale-efficient AI and analytics, at edge locations and at the core. The same data infrastructure that works for the centralized data center should be able to stretch to edge locations, as you’ll see in the real-world customer use cases we describe in this chapter.
How that affects the system can vary, depending on your needs and circumstances. With the consumer edge, each edge is nearly independent, and so data can be buffered and reported whenever a connection is available. In some other cases, it is the aggregate of all data that is important, so it is often reasonable to drop data if a connection isn’t available. Credit card transaction terminals, for example, can buffer, and the traffic reporting based on mobile phone location data (as found in Google Maps or in applications like Waze) can afford to lose a fair bit of data. These systems can usually move data with simple mechanisms like HTTP to a scalable web service that stores the data into a core system. The fact that this scalable web service is likely to be geo-distributed often means that it can be much simpler to design the service if you have a geo-distributed data infrastructure to match.
Other systems may have higher volumes of data, or the data may be more critical. In such cases, it is important that the edge systems do a good job of buffering large amounts of data until a link is available. Once you reach this level of complexity, it begins to be important to support data motion at an infrastructural level so that applications are not burdened with the considerable complexity of providing high-volume transfers over unreliable links.
Real-World Example: Telemetry Backhaul
We worked with a company that does media streaming for mobile telecommunication providers. They had about a hundred regional content service systems around the world and needed to understand how these systems were working. Their regional systems put things in log files that would then be transferred back to a core control and monitoring system. Unfortunately, the complexity of the system for reliably transferring these log records had spiraled.
We helped them build a system based on a data fabric to eliminate that complexity. In this system, each regional center had a small data fabric cluster that hosted a message stream for each class of log information. Each of these regional message streams is replicated back to a single unified message stream in the core cluster at the fabric level. Moving the responsibility for data motion to the data fabric vastly simplified the application side of the problem. The data fabric already had these capabilities built in, so there was no incremental work required to move data other than configuring the replication.
This use of the data fabric to move this telemetry data had a much larger impact on the overall system development than we expected. The thing we hadn’t expected was that the fabric-level data motion made it much easier for developers to focus. On the ingest side, all that the edge developer had to do was get messages written into a single stream. The edge developer then could ignore how and where the messages might go. On the analysis side, in the core data center, the analytics developer only had to look at a single stream to see all the data from all regional centers and could ignore how these messages got there. The administrator of the system could focus only on the pattern of data motion and ignore all aspects of what data was being moved. In retrospect, the impact of letting people focus on simple tasks made sense, but at the time the impact was surprisingly large.
High-Volume Ingest
A key challenge of many edge systems is ingesting and processing enormous quantities of data. This is especially true in IoT industrial edge enterprises such as telecommunications, advanced manufacturing, transportation, and oil and gas exploration. Here’s where having a scale-efficient system built on the fundamentals we described in Chapter 2 really pays off.
At the highest volumes of incoming data, it is often critical that the data be culled or compressed before transmission. Even then, the amount of data to be transferred may still be very large. In such cases, depending on a solid data infrastructure is critical to reliable function because few application developers have the background needed for developing high-volume backhaul systems.
In any case, it is important when moving large amounts of data to carry along application-specific metadata about what is being transferred. At high volumes, the data moved is typically in files, but the metadata describing those files is typically more tabular in nature.
Real-World Example: Autonomous Car Development
When building cars that can drive themselves in the real world, you need real-world data to train the models that do the driving, and you need a lot of it. We have worked with several major automotive companies that are working on this problem to help design a data infrastructure that helps collect data from cars on multiple continents and bring it back to a very large machine learning cluster. The basic architecture is shown in Figure 4-1.
In this system, each field data station consists of a small data fabric cluster that is linked to a very large core cluster. Cars go out on hours-long test drives and gather data at a rate of 1–5GB/second. When they return, this data is transferred to the field data station where it is analyzed; interesting or anomalous parts are extracted. Metadata describing the extracted data are synchronized with the core cluster. This metadata guides the core processes that prepare the data for the machine learning process that builds the models that ultimately drive the cars.
This system handles enormous amounts of data. The field stations collectively ingest many tens of petabytes per day and transfer roughly 5 PB of extracted data to the core cluster, which retains about 500 PB of data or more. All data transfer is fully automated and implemented in the data fabric.
Other edge cases involve very high data rates much as with autonomous vehicle data acquisition. These include, among others, telecom systems and battlefield sensor systems. In many of these other cases, there are such high edge data rates or such poor links back to the core that very little actual data can be sent back. In such situations, it is common that nothing but metadata can be transferred. That is only viable, of course, if the edge units are completely autonomous in the sense that they can act on the data that they collect, and they need to return only very small fractions to the core.
Security and Ownership
Security in edge systems is considerably more complex than with conventional data centers. For starters, concepts like perimeter defense make no sense when there is no real perimeter. Another key problem is that, especially in the industrial edge, there are often complex but very strict limits about who is allowed to see which data. A good example of this arises in systems with hosted fleet management services for edge devices. The operator of the management services needs to see operational details about the devices, but usually must not be able to see data from applications running on or using the managed devices.
It is important that these issues of ownership and visibility be expressed at a platform level and not be left to application designers to enforce. This is true in part because applications are usually less trustworthy than a data fabric that has been hardened through use with hundreds of customers, but also because it is important for permissions to be externally inspectable by auditors or administrators. Encoding custom access controls in applications doesn’t allow such inspection.
Another way that security differs in the edge world is that for any data with significant value, there is an issue of the validity and integrity of the edge software and the hardware running that software. Typically, this requires a trusted platform management (TPM) system that starts with trust mechanisms that first validate a small core of the system and then use that core to expand to larger and larger trust regions. There is, at the time we are writing this, no widely accepted consensus mechanism to extend this hardware root of trust to software workloads, but the widely supported open source SPIFFE Runtime Environment (SPIRE) project is building a mechanism to use the hardware root of trust to anchor a SPIFFE ID that would serve this purpose.
Another, less satisfactory, mechanism would be to embed a client certificate signed by a special-purpose signing authority into the software running on the edge. Such a certificate generally has to be long-lived, however, making management more difficult and possibly less secure. Despite their limitations, device certificates do provide a reasonable minimum level of protection against an attacker being able to forge devices or intercept their communications.
Edge Management
Throughout this report, we’ve described the importance of leveraging management efforts of system administrators and IT teams through efficient separation of concerns. This is supported by handling much of the management and data logistics through platform-level capabilities of the data infrastructure. That really is put to the test with edge systems where data infrastructure must also be highly reliable and preferably self-healing to deal with many locations. Similarly, you need a software framework for efficient orchestration of computation (think Kubernetes, for instance) plus edge hardware that is extremely reliable.
The management of edge systems is considerably harder than it first appears, particularly when you have more than about a dozen edge nodes and each edge node is doing something more than trivial data acquisition. The issues can be broken down into categories based on when you have to deal with them.
-
Day 0 issues are those that come up even before your edge nodes are installed.
-
Day 1 are the issues related to provisioning edge nodes.
-
Day 2 issues are those associated with scaling your systems or the ongoing management of existing nodes.
Day 0: In the factory
During the manufacture of edge nodes, you generally have only a limited ability to customize each node. You normally can specify a factory image containing a fairly standard (but minimal) operating system plus some additional data. Typically, the manufacturing process also installs an unforgeable system identifier in the hardware itself. You also have to recognize the fact that edge units can take time to deliver and may sit on a shelf before installation. This means that the software built into an edge node is not the final software you want to run, but will also be somewhat out of date.
Even so, you can prepopulate a cache of container images on the machine as part of the factory image. If any of these images happen to be what you need when the edge unit is installed and activated, having the image cached will save time and bandwidth.
Beyond the base operating system and this container cache, however, there isn’t much that you can put onto these units. You can, however, record information about what each unit is supposed to do in a central database that will guide the unit through initial installation. This can be done as soon as a unit is allocated for a particular function. Prepopulating a configuration database this way can pay huge dividends by simplifying the installation process. Ideally, the vendor of the edge unit hardware will provide all of this activation and management infrastructure.
Day 1: Installation by troglodytes
It is dangerous to assume any technical skills on the part of the people installing edge units beyond the basic ability to plug in power and network. It isn’t that these people actually lack skills as much as they are likely focused on other jobs and won’t have sufficient information or tools at hand to configure these systems.
Instead, what should happen is that when the units are powered on the first time, they reach out over the internet to the central management system, identify themselves, register their presence, and download directions about what software to run. The hardware root of trust is critical at this point to ensure that only valid hardware can be used. A recent trend that is making things much better in edge systems is the use of microversions of Kubernetes (K3s and KubeEdge are current favorites) at the edge to manage applications. Even if you only have a single processor in your edge location, Kubernetes can simplify your life on Day 2 and beyond because it allows clean management of your application life cycle.
Day 2: Happily ever after
Once you have a micro Kubernetes running, and once you have a secure proxy connected to a central management system, you should be able to manage your applications directly via Kubernetes API calls invoked by the central management system, possibly affecting dozens, hundreds, or thousands of systems at a time. Ideally, these applications will be able to store their data and port it using a centrally managed data infrastructure in an analogous fashion.
As the real-world use cases in this chapter show, it is possible to handle edge systems that carry out AI and analytics at scale, even extreme ones, such as the manufacture of autonomous cars. You can do this in a scale-efficient manner if you make use of effective architecture, edge-designed hardware, and the same data infrastructure that supports non-edge systems as well.
Get AI and Analytics at Scale now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.