The movie industry frequently characterizes a future where we are living with embedded Autonomous Mobile Agents (AMA) that seamlessly perceive, make decisions, and behave like humans. The degree and timing for which this “seamlessness” could potentially become reality depends largely on companies, across all industries, overcoming a key data challenge related to employing artificial intelligence—how to obtain enough data and make sense of it to develop models that will power the agents. Data acquisition is step one; making sense of that data means finding patterns in data, assigning a standard meaning to that pattern, and using it to derive insights and develop models. This is the core of labeling data. Across industries and companies, this is a common fundamental challenge—you can see the symptoms through the investments. For example, Intel with Intel Saffron, Nervana, Altera, Movidius, and Mobileye; Google with DeepMind Technologies, MoodStock, Api.ai, and Hailie labs; Apple with Lattice Data, RealFace, and Sensomotoric Technologies, to name a few. In this post, I would like to demonstrate the complexity of the data labeling challenge through the frame of AMA development.
What is an Autonomous Mobile Agent?
An Autonomous Mobile Agent is a robot that can have multiple physical embodiments or instantiations, such as a drone, car, boat, etc. The embodiment is simply the physical form that will determine the range of functions that will be executed by artificial intelligence models deployed on it. The models are developed, trained, and evaluated in a data center and deployed at the edge or remote location with a subset of processing needed to execute the models and send additional information back to the data center. One example of an AMA could be a self-driving car that has embedded intelligence to transport you safely from one location to another.
What does it take to create an Autonomous Mobile Agent?
To create an AMA, you must train the AMA to develop the ability to perceive, make decisions, and behave in a way that is consistent with the environment within which it needs to navigate in the future. Acquiring the data and developing the ability to make decisions on that information for an AMA is similar to the knowledge acquisition that occurs in humans. As a baby, you had no language or means of communicating, but you had basic senses (visual, auditory, haptics or touch, etc.), which you used to acquire data and start making sense of your environment. You likely had caregivers who were actively trying to teach you what things are within your environment (“Here is an apple and it is red” or “When a ball is thrown toward you, catch it.”). This continues until you have enough information to start extrapolating specific learning instances to other domains. This is knowledge acquisition. Sometimes your extrapolations are correct (a red car is perceived as red) and other times they are not (a banana is perceived as an apple,) but you are repeatedly corrected, which improves your understanding of your environment. This feedback helps you learn new things and understand the conditions on which your extrapolations can be applied correctly or incorrectly.
When you are older, you start rejecting the feedback or corrections of caregivers in younger years because you have acquired new knowledge of how things work. It is beliefs or models that drive hypotheses, helping you determine which behaviors or actions you should take. These models of behavior are then modified from feedback mechanisms, either through positive reinforcement (increases behavior via positive feedback) or negative reinforcement (decreases behavior via negative feedback), which results in either increasing similar behavior in the former case or decreasing behavior in the latter case. Honing your internal models results in new learning that can be applied in a diversity of scenarios.
Now that we have reviewed knowledge acquisition in humans, we can show parallel steps for AMA learning. Let’s start with the assumption that an AMA is simply a dumb form factor that needs to be able to make sense of its environment. To do this, we need to add mechanisms to acquire data about the environment, for example through sensors like radar, LiDAR (light detection and ranging), cameras, and GPS. The acquired data is meaningless and useless unless the AMA has a way to identify items within it. In machine learning, we call this object detection, recognition, and localization. Object detection is equivalent to detecting the presence of something that is not yet identified, recognition is matching that item to an existing concept in memory, and object localization is knowing where that thing is relative to your current location. To develop this capability, you need a lot of sensor data and you need to know what is in the data.
To determine what is in the data, a time-intensive process of manually identifying and labeling items, frame by frame, in each type of sensor stream is required. For example, a labeler would watch a video feed, frame by frame, and draw boxes around the items that match the identity of the requested labels. When a labeler sees a stop sign in a frame, they place a bounding box around it. This labeled data, called “ground truth,” is used to develop models. If you have poor quality ground truth, you will have models that do not predict reality as it occurs, which means your AMA will learn inaccurate information and, in turn, will behave in a manner that is inconsistent with the context in which it has been designed to behave. The specific types of artificial intelligence models you are developing will determine how much ground truth data you will need. Models are designed to accurately predict items or events in data streams, and this requires testing their predictive ability against known entities in the ground truth data. To draw the parallel to humans, your child points to the snack on their plate and says "banana," but it is actually an apple. In this example, the child had correctly detected a snack, but has incorrectly identified it as a banana (the ground truth was apple). Over many instances of encountering an apple, they will learn to identify it as an apple. Let’s now look at how models are built and improved, and how a model can accurately identify an object.
As we mentioned above, the type of model that is being developed determines what objects and events (labels) need to be in the ground truth data. For example, to build an autonomous vehicle model that accurately detects stop signs, a machine learning engineer may need 500 instances of stop signs, 500 instances of other road signs, and 500 instances where there is no stop sign present. To get that amount of ground truth data, you may very well need to acquire 100 TB of data because you need to label more data to actually capture the number of events of interest.
Once manual labeling of objects has been completed, for all sensor types, these events are combined to improve accuracy in a capability called “sensor fusion.” For example, labeled camera data can be fused with LiDAR data to increase accuracy of a given object detection model. This is similar to how color-blind children might use many senses (e.g., haptics, visual, and auditory) to process the slight variations, shapes, and verbal names for a red apple. After there is enough ground truth data for the model to learn (again, this is determined by model type), it can then begin to infer on new data, developing a capability called “automatic labeling.”
Automatic labeling executes previously developed object detection models on new data and facilitates the development of other models by providing ground truth on data sooner than the manual labeling process described earlier. There are different types of models—cluster-based, neural networks, decision trees, etc. It is the combination of many models, within a hierarchy, that creates driver policy, or rules for behavior. Once we have models, their efficiency and accuracy needs to be evaluated in a series of simulated environments where we can challenge the car’s behavior in randomized, newly encountered scenarios. A model’s strength is evaluated by its performance (e.g., efficiency at calculating outcomes) and accuracy (e.g., proportion of correct outcomes) within the context in which it was designed. For example, in some scenarios you may want fast performance while you are okay with 80% accuracy, but in another scenario, the accuracy needs to be 99% and you are therefore willing to accept slower performance. Only after rigorous model evaluation and improvement can one model be deployed in the AMA where it will be able to perceive, think, and behave in a way similar to humans.
What is the future of seeing, thinking and perceiving AMAs?
The process of seeing, thinking, and behaving for an AMA is not a simple task, but it is a process that can be achieved with time. In the coming years, there will likely be additional ways for a car to learn once infrastructures for car-to-car and car-to-infrastructure communication are in place. What we have discussed so far has focused on a single domain: driving. That is, how to navigate a prescribed range of behaviors within a specific type of domain. Some questions for you to consider as you look to a future of AMAs: what other models (e.g., experiences, behaviors, environments, etc.) would need to be developed for an AMA to be seamlessly integrated into our lives? What are the technology constraints that currently exist in developing AMAs? How does our society change as we add more domains of models? Are there some domains in life where building models is impossible (e.g., emotions)?
This post is a collaboration between O'Reilly and Intel Saffron. See our statement of editorial independence.