Chapter 4. AI and Machine Learning: A Nontechnical Overview

Although it is not necessary to be an expert or practitioner of AI in order to develop an AI vision and strategy, having a high-level understanding of AI and related subject matter areas is critical to making highly informed decisions. Helping you to develop this understanding is the goal of this chapter.

This chapter defines and discusses AI-related concepts and techniques, including machine learning, deep learning, data science, and big data. We also discuss how both humans and machines learn and how that is related to the current and future state of AI. We finish the chapter by covering how data powers AI and data characteristics and considerations necessary for AI success.

This chapter helps develop a level-appropriate context for understanding the next chapter on real-world opportunities and applications of AI. Let’s begin by discussing the field of data science.

What Is Data Science, and What Does a Data Scientist Do?

Let’s kick off the discussion by defining data science and the role and responsibilities of a data scientist, both of which describe the field and skills required to carry out AI and machine learning initiatives (note that more specialized roles are becoming more common, such as machine learning engineer). Even though data scientists often come from many different educational and work experience backgrounds, most should be strong (or, ideally, experts) in four fundamental areas that I call the four pillars of data science expertise. In no particular order, these are the areas in which data scientists should have expertise:

Business in general or in the relevant business domain
Mathematics (including statistics and probability)
Computer science (including software programming)
Written and verbal communication

There are other skills and expertise that are highly desirable, as well, but these are the primary four in my opinion. In reality, people are usually strong in one or two of these pillars, but not equally strong in all four. If you happen to meet a data scientist who is truly an expert in all, you’ve found a person often referred to as a unicorn. People with an appreciable degree of expertise and competency in all four pillars are very difficult to find, and there’s a significant shortage of talent.

As a result, many companies have begun to create specialized roles around specific pillars of data science, which when combined, are the equivalent of having a data scientist. An example could be creating a team of three people, of which one person has an MBA-type background, another is a statistician, and another is a machine learning or software engineer. The team could also include a data engineer, as well, for example. This team then could work on multiple initiatives at once, with each person focusing on a specific aspect of an initiative at any given time.

Based on these pillars, a data scientist is a person who should be able to use existing data sources and create new ones as needed in order to extract meaningful information, generate deep actionable insights, support data-driven decision making, and build AI solutions. This is done with business domain expertise, effective communication and results interpretation, and utilization of any and all relevant statistical techniques, programming languages, software packages and libraries, and data infrastructure. This, in a nutshell, is what data science is all about.

Machine Learning Definition and Key Characteristics

Machine learning is often considered a subset of AI. We discuss machine learning first in order to develop a foundation for our discussion of AI and its limitations later in this chapter.

Remember our simple definition of AI as intelligence exhibited by machines. This basically describes the ability of machines to learn from information and apply this knowledge to do things as well as continue learning from experience. In many AI applications, machine learning is set of techniques used for the learning part of the AI application process. Specific techniques that we discuss later can be considered subsets of AI and machine learning, and commonly include neural networks and deep learning, as shown in Figure 4-1.

I really like this short and succinct definition of machine learning that I came across in a Google Design blog article: “Machine learning is the science of making predictions based on patterns and relationships that’ve been automatically discovered in data.”

A nontechnical definition of machine learning that I usually give is that machine learning is the process of automatically learning from data without requiring explicit programming, with the ability to expand the knowledge learned with experience. A key differentiator of machine learning relative to rules-based techniques is the lack of explicit programming, particularly around specific domains, industries, and business functions. In advanced techniques such as deep learning, domain expertise might not be required at all, whereas in other cases, domain expertise is provided in the form of the features (in nonmachine learning applications referred to as variables, data fields, or data attributes) selected or engineered to train models. In either case, the part about not requiring explicit programming is absolutely critical, and is really the most important aspect of machine learning to understand. Let’s put this in the context of an example.

Suppose that you were a programmer before machine learning was a thing, and you were tasked with creating a predictive model capable of predicting whether a person applying for a certain type of loan would default on that loan and therefore should be approved or not approved for it. You would have written a long software program that was very specific to the financial industry with inputs such as a person’s FICO score, credit history, and type of loan being applied for. The code would contain lots of very explicit programming statements (e.g., conditionals, loops). The pseudo code (programming code written in plain English) might look something like this:

If the persons FICO score is above 800, then they will likely not default
  and should be approved
Else if the persons FICO score is between 700 and 800
    If the person has never defaulted on any loan, they will likely not
      default and should be approved
    Else the will likely default and should not be approved
Else if the persons FICO score is less than 700
    ...

This is an example of very explicit programming (a rules-based predictive model) that contains specific domain expertise around the loan industry that is expressed as code. This program is hard-coded to accomplish only one thing. It requires domain/industry expertise to determine the rules (aka scenarios). It is very rigid and not necessarily representative of all factors contributing to potential loan default. The program must also be manually updated for any changes to the inputs or the loan industry in general.

As you can see, this is not particularly efficient or optimal and also will not result in the best predictive model possible. Machine learning using the right data, on the other hand, is able to do this without any explicitly written code, particularly code expressing loan industry expertise. Giving a slightly oversimplified explanation here, machine learning is able to take a dataset as input without knowing anything about the data or domain involved, pass it through a machine learning algorithm that also knows nothing about the data or domain involved, and produce a predictive model that has expert knowledge of how the inputs map to the output in order to make the most accurate predictions possible. If you understand this, you pretty much understand the purpose of machine learning at a high level.

It’s worth mentioning that while machine learning algorithms themselves are able to learn without requiring explicit programming, humans are still very much needed and involved in the entire process of ideation, building, and testing machine learning–based AI solutions.

Ways Machines Learn

Machines learn from data through a variety of different techniques, with the most predominant being supervised, unsupervised, semisupervised, reinforcement, and transfer learning. The data used to train and optimize machine learning models is usually categorized as either labeled or unlabeled, as shown in Figure 4-2.

Labeled data has a target variable, or value, that is intended to be predicted for a given combination of feature values (aka variables, attributes, fields). In predictive modeling, a type of machine learning application, a model is trained on a labeled dataset in order to predict the target value for new combinations of feature values. The presence of target data in the dataset is why the data is referred to as labeled. Unlabeled data, on the other hand, has feature values, but no particular target data or labels. This makes unlabeled data particularly well suited for grouping (aka clustering and segmentation) and anomaly detection.

One thing worth noting is that, unfortunately, labeled data in enough quantity can be very difficult to come by and can cost a lot of money and time to produce. Labels can be added to data records automatically or might require being added manually by people (think of a data record, aka sample, as a row in a spreadsheet or table).

Supervised learning refers to machine learning using labeled data, and unsupervised learning with unlabeled data. Semisupervised learning uses both labeled and unlabeled data.

Let’s briefly discuss the different learning types at a high level. Supervised learning has many potential applications such as prediction, personalization, recommender systems, and pattern recognition. It is further subdivided into two applications: regression and classification. Both techniques are used to make predictions. Regression is used primarily to predict single discrete or real number values, whereas classification is used to assign one or more classes or categories to a given set of input data (e.g., spam or not-spam for emails).

The most common applications of unsupervised learning are clustering and anomaly detection, whereas in general, unsupervised learning is largely focused on pattern recognition. Other applications include dimensionality reduction (simplifying the number of data variables and also simplifying model complexity) using principal component analysis (PCA) and singular value decomposition (SVD).

Although the underlying data is unlabeled, unsupervised techniques can be employed in useful predictive applications when labels, characterizations, or profiles are applied to discovered clusters (groupings) through another process outside of the unsupervised learning process itself. One of the challenges with unsupervised learning is that there is no particularly good way to determine how well an unsupervised learning–generated model performs. The output is what you make of it, and there is nothing correct or incorrect about it. This is because there is no label or target variable in the data and therefore nothing against which to compare model results. Despite this limitation, unsupervised learning is very powerful and has many real-world applications.

Semi-supervised learning can be a very useful approach when unlabeled data is plentiful and labeled data is not. Other popular types of learning that we cover more in the next chapter include reinforcement learning, transfer learning, and recommender systems.

In machine learning tasks involving labeled and unlabeled data, the process takes data input and maps it to an output of some sort. Most machine learning model outputs are surprisingly simple, and are either a number (continuous or discrete, e.g., 3.1415), one or more categories (aka classes; e.g., “spam,” “hot dog”), or a probability (e.g., 35% likelihood). In more advanced AI cases, the output might be a structured prediction (i.e., a set of predicted values as opposed to single value), a predicted sequence of characters and words (e.g., phrases, sentences), or artificially generated summary of the most recent Chicago Cubs game (GO CUBS!).

AI Definition and Concepts

Earlier we gave a simple definition of AI as intelligence exhibited by machines, which includes machine learning and specific techniques such as deep learning as subsets. Before further developing a definition of AI, let’s define the concept of intelligence in general. A rough definition for intelligence is:

Learning, understanding, and the application of the knowledge learned to achieve one or more goals

So, basically intelligence is the process of using knowledge learned to achieve goals and carry out tasks (for humans, examples include making a decision, having a conversation, and performing work tasks). Having now defined intelligence in general, it’s easy to see that AI is simply intelligence as exhibited by machines. More specifically, AI describes when a machine is able to learn from information (data), generate some degree of understanding, and then use the knowledge learned to do something.

The field of AI is related to and draws from aspects of neuroscience, psychology, philosophy, mathematics, statistics, computer science, computer programming, and more. AI is also sometimes referred to as machine intelligence or cognitive computing given its foundational basis and relationship to cognition; that is, the mental processes associated with developing knowledge and comprehension.

More specifically, cognition and the broader field of cognitive science are terms used to describe the brain’s processes, functions, and other mechanisms that make it possible to gather, process, store, and use information to generate intelligence and drive behavior. Cognitive processes include attention, perception, memory, reasoning, comprehension, thinking, language, remembering, and more. Other related and somewhat deeper and philosophical concepts include mind, sentience, awareness, and consciousness.

So what powers intelligence? For AI applications, the answer is information in the form of data. In the case of humans and animals, new information is constantly collected from experience and the surrounding environment through the five senses. This information is then passed through the cognitive processes and functions of the brain.

Amazingly, humans can also learn from existing information and knowledge already stored in the brain by applying it to understand and develop knowledge about something else as well as to develop one’s thoughts and opinions about a new topic, for example. How many times have you thought through something given information you already understood and then had an “aha!” moment that resulted in a new understanding of something else?

Experience factors heavily into AI, as well. AI is made possible by a training and optimization process that uses relevant data for a given task. AI applications can be updated and improved over time as new data becomes available, and this is the learning-from-experience aspect of AI.

Continually learning from new data is important for many reasons. First, the world and its human inhabitants are always changing around us. Trends and fads come and go; new technologies are introduced, and old technologies become obsolete; industries are disrupted; and new innovations are continuously introduced. As a result, data today related to online shopping, for example, might be very different than the data you receive tomorrow, or years from now. Automotive manufacturers might begin asking what factors contribute most to people purchasing flying vehicles, as opposed to the electric vehicles that are gaining popularity and wider use today.

Ultimately, data and the models trained from it can become stale, a phenomena referred to as model drift. It is therefore critical that any applications of AI are refreshed and continue to gain experience and knowledge through continued learning from new data.

AI Types

Often AI is referred to with a qualifier like strong or narrow. These qualifiers, which we cover next, are meant to describe the nature of the AI being discussed. The aspect of AI being described by the qualifier can be related to the number of simultaneous tasks that an AI can carry out; the architecture of a given algorithm, in the case of neural networks; the actual use of the AI; or the relative difficulty of solving a given problem using AI.

Although this might differ depending on the reference or researcher, AI can be grouped into categories and relationships, as shown in Figure 4-3.

Starting with artificial narrow intelligence (ANI), the terms “weak” and “narrow” are both used interchangeably to indicate that an AI is specialized and able to carry out only a single, narrow task; it is not able to demonstrate cognition. This means that weak AI, although often considerably impressive, is not sentient, aware, or conscious in any way. As of this writing, almost all AI is considered weak AI.

“Shallow” and “deep” are qualifiers used to describe the number of hidden layers in a neural network architecture (discussed in detail in Appendix A). Shallow AI usually refers to a neural network with a single hidden layer, whereas deep AI (synonymous with deep learning) refers to a neural network with more than one hidden layer.

Applied AI is as it sounds. It is the application of AI to real-world problems such as prediction, recommendation, natural language, and recognition. These days you often hear the term smart used to describe AI-powered software and hardware solutions (e.g., smart homes). That is, some form of AI is being used as part of the solution, although companies often exaggerate their use of AI. Given that all AI today is considered narrow, applied AI is shown as being related to narrow AI. That might change in the future, which brings us to the next category.

Artificial general intelligence (AGI) is also called “strong” or “full” AI. AGI sets the bar in that it represents machine intelligence able to demonstrate cognition and carry out cognitive processes to the same degree as a human. Or, in other words, it has cognitive abilities that are functionally equivalent to a human. This means that a machine can perform any task a human can and is not limited to applying intelligence to a single specific problem. This is an extremely high bar to set. We discuss AGI and the challenges of achieving it in more detail later in this chapter.

It’s worth noting that certain AI problems are referred to as AI-complete or AI-hard (e.g., AGI, natural language understanding), which just means that these problems are very advanced and difficult to solve completely and in general. Creating machines that are equivalently intelligent to humans is a very difficult problem to solve, and is not the AI that we have today.

Artificial super intelligence (ASI), and related concepts such as technological singularity and superintelligence, describe the scenario in which AI becomes self-improving in a runaway fashion and ultimately surpasses human intelligence and technological advancement. Even though the possibility of a singularity and superintelligence is largely debated, it is highly unlikely anytime in the near future, if at all. Also, although it’s not anything to worry about right now, it’s worth noting that certain techniques such as deep reinforcement learning are being used in AI applications for self-directed learning that improves over time.

Learning Like Humans

Consider that babies and very young children are able to recognize an object such as a particular animal in almost any context (e.g., location, position, pose, lighting) despite having seen a picture or illustration of a particular animal only once, for example. This is a remarkable feat of the human brain, which involves initial learning followed by the application of that learning to different contexts.

There is a great article that was published in the MIT Technology Review called “The Missing Link of AI,” written by Tom Simonite. The article is about the way that humans learn and how AI techniques must evolve in order to be able to learn in a similar way and ultimately exhibit human-like intelligence. Jeff Dean from Google is quoted there as saying, “Ultimately, unsupervised learning is going to be a really important component in building really intelligent systems—if you look at how humans learn, it’s almost entirely unsupervised.” Yann LeCun expands on this: “We all know that unsupervised learning is the ultimate answer.”

The article points out that infants learn by themselves that objects are supported by other objects (e.g., book on a coffee table) and therefore supported objects do not fall to the ground despite the force of gravity. Children also learn that inanimate objects will remain in a room in the same place after they leave the room, and they can expect them to still be there when they return. They do this without being explicitly taught, or in other words, this learning is unsupervised and does not involve labeled data, where the label could be a parent teaching a child something.

Children also learn by trying different things over time, such as through experimentation and trial and error. They do this even when they’re not supposed to, or even when knowing that some outcomes might be negative, but they do so in order to learn about cause and effect and to learn about the world around them in general. This method of learning is highly analogous to reinforcement learning, which is a very active area of research and development in AI and can potentially help make significant progress toward human-like intelligence.

In the context of human learning in general, humans are able to sense the world around them and put meaning to things on their own. These things can be recognizing patterns, objects, people, and places. It can be figuring out how something works. Humans also know how to use natural language to communicate. Much of how humans learn happens in an unsupervised, self-learning, and trial-and-error way. These are all remarkable feats of the human brain; feats that are very difficult to emulate with algorithms and machines, as we discuss next.

AGI, Killer Robots, and the One-Trick Pony

AGI—producing an AI that is able to do and understand anything a human can at least as well—is still a long way off. This means that we don’t need to worry about killer robots for now, or maybe ever. Pedro Domingos, in his book The Master Algorithm¹ states, “People worry that computers will get too smart and take over the world, but the real problem is that they’re too stupid and they’ve already taken over the world.” He goes on to say that the chances of AI taking over the world are zero given the way in which machines learn and, most important, because computers don’t have a will of their own.

AGI is a very difficult problem to solve. Think about it—to truly replicate all human intelligence in a machine, the AI would need to be able to observe the world around it, self-learn on an ongoing and self-directed basis (i.e., demonstrate true autonomy) in order to continuously make sense of everything, and potentially self-improve like humans. It would need to understand everything humans do, possibly more, and be able to generalize and transfer knowledge to any context. This is largely what children and adults do. But how do you do that when unsupervised learning has no correct answers as we discussed earlier? How do you train a machine to learn something without teaching it—that is, make it self-learning?

These are great questions, and the answer is that you don’t, at least not with today’s state-of-the-art AI and machine learning methods. Currently, the most advanced techniques in AI include neural networks, deep learning, transfer learning, and reinforcement learning. These techniques are not particularly suited for unsupervised learning applications. They are also very focused on a single, highly specialized task.

A deep learning neural network that is trained to recognize cats in an image is not able to also predict your home price three years from now; it’s capable of nothing more than recognizing cats in an image. If you want a predictive model to predict your home price, you must create and train a separate model. AI is therefore not good at multitasking, and each instance is pretty much a one-trick pony for now.

Although the human brain remains a mystery to neuroscientists as to how it works exactly, one thing appears to be clear: The brain is not a pure calculating machine in the same sense as a computer. The brain is thought to process sensory information with a complex, algorithmically based biological neural network mechanism that can store memories based on patterns, solve problems, and drive motor actions (behavior) based on information recall and prediction.

This is a process that begins at birth and continues throughout our lifetimes, and often in a highly unsupervised, trial-and-error-based way as mentioned. The brain’s incredible memory storage and recall mechanism is what differentiates it from a pure calculating machine, and is what makes human unsupervised learning possible. A single human brain is able to continuously learn and store all information learned and memories developed in an entire human lifetime. How would a machine emulate that? It wouldn’t—at least not anytime soon.

Unlike the unsupervised and self-learning of humans, computing machines are completely dependent on extremely detailed instructions. This is what we call software code. Even automatically learned and nonexplicity programmed predictive models (the magic of AI and machine learning) are incorporated into software-based programs written by computer programmers. Given the current state-of-the-art of AI, AGI is impossible if a machine isn’t trained on or programmed exactly for every possible sensory scenario that it will encounter in any environment and under any condition.

One implication of this is that intelligent machines cannot have free will like humans; that is, the ability to make any decision or take any action within reason in the way that humans do, given a limited or even unlimited set of possibilities. With the exception of techniques like reinforcement learning, intelligent machines are constrained to only mapping inputs to certain outputs.

Human brains, on the other hand, can react to scenarios that they have or have not previously encountered. They can integrate sensory information from the five senses naturally in real time, with seemless ease, and with great relative speed. Humans are able to continuously adapt to unplanned changes in their environment, such as having unexpected conversations with people (e.g., a phone call, running into a friend), figuring out why the TV suddenly won’t turn on, dealing with sudden changes in weather, reacting to accidents (e.g., car, spills, broken glass), missing a bus, determining that an elevator is out of service (humans immediately know to locate the stairway instead), a credit card not working, a grocery bag breaking, avoiding a child that suddenly runs across their path—the number of real-world examples are almost infinite.

Humans are also able to think, a process that does not require sensory input data. You could be sitting on a beach staring out at the ocean waves while thinking about many things totally unrelated to the beach and ocean, but today’s AI algorithms are like meat grinders: you need to put beef into the grinder in order to get ground beef. Aside from techniques such as reinforcement learning, AI algorithms do not produce an output without related inputs, especially not anything close to human thoughts.

In The Book of Why² the authors discuss that humans are also able to reason, make decisions, take actions, and come to conclusions based on having a causal understanding (cause and effect) of the world and the ability to reflect. Reflection means that we’re able to retrospectively look at our decisions or actions, analyze the results, and decide whether we would have done things differently, or will do something different in a similar situation next time. This is a form of natural human learning where the inputs are previous actions taken or decisions made.

We also have a causal understanding of the world that we continue to develop throughout our lifetimes. We know that correlation does not imply causation, and yet most AI and machine learning algorithms are based on correlations (e.g., predictive analytics) and have absolutely no concept of causation. A well-known example is the fact that increases in ice cream sales are accompanied by increased drowning deaths, therefore a predictive algorithm might learn that increased ice cream consumption causes drowning. With a little thought, humans can easily figure out that the missing factors, called confounding variables, are time of year and temperature, which are the true causes of the increases in both. AI would be unable to figure this out.

Lastly, there is a significant difference between automation and autonomy, both of which are highly relevant in the context of the progression of robotics and AI toward AGI. Automation is the result of writing software programs that automatically perform a task on a one-off or repeating basis that previously required some human assistance. Autonomy, on the other hand, is all about independence, self-direction, and the ability to respond to interactions and changes in the environment. There are varying degrees of both automation and autonomy in existing robotics and AI applications, with the majority of applications being more on the automation side at this time.

True autonomy is very difficult for reasons already mentioned in the context of AGI but also because of limitations with sensing techniques such as computer vision. Computer vision and image recognition have come a long way in terms of object detection and identification in controlled and consistent circumstances, but the technology is not very good at understanding in ever-changing, inconsistent, and surprise-filled environments that better reflect reality.

The Data Powering AI

There is one thing that AI, machine learning, big data, IoT, and any other form of analytics-driven solutions have in common: data. In fact, data powers every aspect of digital technology.

This chapter covers the power of data, including using data to make decisions, common data structures and formats used in AI applications, data storage and common data sources, and the concept of data readiness.

Big Data

The world has never collected or stored as much data as it does today. In addition, the variety, volume, and generation rate of data is growing at an alarming rate. Rio Tinto, for example, a leading mining company that generates more than $40 billion in revenue, has embraced big data and AI to make data-driven decisions from 2.4 terabytes of sensor data generated per minute!

The field of big data is all about efficiently acquiring, integrating, preparing, and analyzing information from these enormous, diverse, and fast-moving datasets. However, handling and extracting value from these datasets might not be feasible or achievable due to hardware and/or computational constraints. To deal with these challenges, new and innovative hardware, software tools, and analytics techniques are required. Big data is the term that is used to describe this combination of datasets, techniques, and customized tools.

Also, data of any kind is basically useless without some form of accompanying analytics (unless the data is being monetized). In addition to the description given, big data is also used by people to describe performing analytics on very large datasets, which can include advanced analytics techniques such as AI and machine learning.

Data Structure and Format For AI Applications

At a high level, we can classify data as structured, unstructured, or semistructured as illustrated in Figure 4-4.

Let’s begin with structured data. We can think of structured data as, well, as data having structure. Although shown in Figure 4-4 in tabular form, structured data is data that is generally organized and can easily fit into a table, spreadsheet, or relational database, for example. Structured data is typically characterized as having features, also called attributes or fields. When structured this way, it is commonly referred to as a data model, and it becomes relatively easy to query, join, aggregate, filter, and sort the data.

Figure 4-4 shows an example of structured data in table format. In this case, the data is organized into columns and rows, where rows represent individual data examples (aka records, samples, or data points). The columns represent the data features for each example. Figure 4-4 also shows examples of labeled and unlabeled data, concepts we discussed earlier.

Unstructured data is the opposite of structured data and is therefore not organized or structured in any way, nor characterized by a data model. Common examples include images, videos, audio, and text such as that found in comments, the body of emails, and speech translated to text.

Note that unstructured data can be labeled, as is often the case with images. Images can be labeled based on the primary subject of the image; for example, images labeled as either cat or dog depending on which animal type is depicted.

Semistructured data has some structure, but is not easily organized into tables like structured data. Examples include XML and JSON formats, both of which are often used in software applications for data transfer, payloads, and representation in flat files.

The final type and format of data that is relevant to AI applications is sequence data, with language and time series as two common examples. Sequence data is characterized by data ordered in a sequence for which the ordering mechanism is some sort of index. Time is the index in time-series data, and sensors in an IoT or data acquisition system are a great example of a source of time-series data.

Another example of sequence data is language. Language is characterized not only by grammar and use in communication, but also by the sequences of letters and words. A sentence is a sequence of words, and when the word sequence is rearranged it can easily take on a different meeting, or in the worst case, make no sense whatsoever. Words are arranged in a way that has a very specific meaning and makes the most sense to those that speak a given language.

Data Storage and Sourcing

Companies and people in general generate a ton of data, and often through disparate and nonunified software and hardware applications, each built on a unique “backend,” or database. Databases are used for both permanent and temporary data storage. Databases come in many different types, which includes the way they physically store data on disk, the type of data they store (e.g., structured, unstructured, and semistructured), the data models and schemas they support, the query language they use, and the way they handle governance and management tasks such as scalability and security. In this section, we focus on some of the most commonly used databases for AI applications: relational databases and NoSQL databases.

Relational database management systems (RDBMS) are very well suited for storing and querying structured relational data, although some support storing unstructured data and multiple storage types, as well. Relational data means that data stored in different parts (i.e., tables) of the database are often related to one another according to predefined types of relationships (e.g., one to many). Each table (or relation) consists of rows (records) and columns (fields or attributes), with a unique identifier (key) per row. Relational databases typically offer data integrity and transactional guarantees that other databases do not.

NoSQL database systems were created for, and have gained widespread popularity primarily due to benefits relating to, scalability and high availability. These systems are also characterized as being modern web-scale databases that are typically schema-free, provide easy replication, and have simple application programming interfaces (APIs). They are best suited for unstructured data and applications involving huge volumes of data—for example, big data. In fact, many of these systems are designed for extraordinary request and data volumes that can take advantage of massive horizontal scaling (e.g., thousands of servers) in order to meet demand.

There are multiple types of NoSQL databases, with document, key–value, graph, and wide-column being the most prevalent. The different types refer mainly to how the data is stored and the characteristics of the database system itself. It’s worth noting another type of database system that’s been getting some attention in recent years. NewSQL database systems are relational database systems that combine RDBMS-like guarantees with NoSQL-like scalability and performance.

Specific Data Sources

There are a lot of specific types of data sources, and many are used simultaneously at any given large company. Certain types of data can be used to automate and optimize customer-facing products and services, whereas others are better suited for optimizing internal applications. Here is a list of potential data sources, which we will look at individually:

Customers
Sales and marketing
Operational
Event and transactional
IoT
Unstructured
Third party
Public

Most companies use a customer relationship management tool, or CRM. These tools manage interactions and relationships with existing and potential customers, suppliers, and service providers. Additionally, many CRM tools are able to manage multichannel customer marketing, communications, targeting, and personalization either natively and/or through integrations. As a result, CRM tools can be a very significant source of data for customer-centric AI applications.

Although many companies use CRM tools as their primary customer database, customer data platform (CDP) tools such as AgilOne are used to create a single, unified customer database by combining data sources around customer behavior, engagement, and sales. CDP tools are intended to be used by nontechnical people, and are similar to data warehouses in that they’re used to drive efficient analytics, insights gathering, and targeted marketing.

Sales data is some of the most, if not the most, important data that a company has. Typical data sources include point-of-sale data for companies with brick-and-mortar locations, ecommerce data for online shopping applications, and accounts receivable for sales of services. Many companies that sell products at physical locations also sell products online and therefore are able to use both sources of data.

Marketing departments communicate and provide offers to customers through multiple channels and generate channel-specific data accordingly. Common marketing data sources can include email, social, paid search, programmatic advertising, digital media engagement (e.g., blogs, whitepapers, webinars, infographics), and push notifications for mobile apps.

Operational data is centered around business functions and processes. Examples include data associated with customer service, supply chain, inventory, ordering, IT (e.g., network, logs, servers), manufacturing, logistics, and accounting. Operational data is often best harnessed to gain deep insights into internal company operations in order to improve and potentially automate processes to achieve goals such as increasing operational efficiency and reducing costs.

For companies built primarily around digital products such as Software as a Service (SaaS) applications and mobile apps, there is usually a lot of event and transactional-based data generated and collected. It’s worth noting that even though individual sales can certainly be considered transactional, not all transactional data is associated with sales. Event and transaction data can include bank transfers, submitting an application, abandoning an online shopping cart, and user interaction and engagement data such as clickstream and data collected by applications like Google Analytics.

With the IoT revolution in full swing, research indicates that it will generate up to $11 trillion dollars in economic value through more than 75 billion connected devices worldwide by 2025. Needless to say, a huge and increasing amount of data is generated by connected devices and sensors. This data can be very useful for AI applications.

Companies also have a lot of highly valuable unstructured data that often goes largely unused. Unstructured data as previously discussed can include images, videos, audio, and text. Text data can be particularly useful for natural language processing applications when stemming from product or service customer reviews, feedback, and survey results.

Lastly, companies usually employ multiple third-party software tools that might not have been mentioned in this section. Many software tools allow data to be integrated with other tools and also exported for analysis and portability. Third-party data can be purchased in many cases, as well. Lastly, with the internet explosion and open source movement, there is also a tremendous amount of freely available and highly useful publicly available data that we can use.

The keys to using data to help generate deep actionable insights and power AI solutions are data availability and access, whether to centralize the data or not, and all of the data readiness and quality considerations, which I cover in the next section.

Data Readiness and Quality (the “Right” Data)

Let’s close this chapter with a critical concept that is a major consideration in AIPB—the concept of data readiness and quality. High quality and ready data (as we’ll define) that can successfully power a certain AI solution is what I call the “right” data. This is paramount to solution success.

I use the term data readiness to collectively refer to the following:

Adequate data amount
Adequate data depth
Well-balanced data
Highly representative and unbiased data
Complete data
Clean data

Let’s discuss the concept of feature space before going over each of these data readiness points in turn. The term “feature space” refers to the number of possible feature value combinations across all features included in a dataset being used for a specific problem. In many cases, adding more features results in an exponential increase in the amount of data required for a given problem due to a phenomenon known as the curse of dimensionality, which we discuss further in the sidebar that follows in a few moments.

Adequate Data Amount

Let’s begin with the need for an adequate amount of data. Enough data is required to ensure that the relationships discovered during the learning process are representative and statistically significant. Also, the more data you have, the more accurate the model is likely to be. More data also allows for simpler models and a reduced need to create new features from existing ones, which is a process known as feature engineering. Feature engineering can be as simple as converting units; other times, it involves creating entirely new metrics from combinations of other features.

Adequate Data Depth

It’s not enough to have adequate amounts of data in general: AI applications also require enough varied data. This is where adequate data depth comes into play. Depth means that there’s enough varied data that adequately fills out the feature space—a good enough set of combinations of different feature values that a model is able to properly learn the underlying relationships between data features themselves as well as between data features and the target variable when present in labeled data.

In addition, imagine having a data table consisting of thousands of rows of data. Suppose that the vast majority of the rows consists of the exact same feature values repeated. In this case, having a lot of data doesn’t really do us any good if the model is able to learn only whatever relationships are represented between the repeated feature data and the target. One thing to note is that it is highly unlikely that any given dataset will have every combination of all feature values and therefore completely fill the given feature space. That’s okay and is usually expected. You often can get adequate results with enough variation in the data.

The Curse of Dimensionality

Let’s discuss a concept known as the “curse of dimensionality,” in which the terms dimensionality or dimensions can be used interchangeably with features. The curse of dimensionality, in simple terms, essentially means that adding additional features to a given dataset will cause a nonlinear, exponential growth of the number of possible feature value combinations across all features, collectively referred to as the feature space, for a given problem. This increase in the set of possible feature value combinations across all features can have many potentially challenging consequences, which we discuss more in a moment.

In supervised learning, we could likewise call the range of all possible values of the target variable the “target space.” For binary classification tasks, this might consist of only two possible values, but for multilabel or regression tasks, this could span a much larger amount of possible values. As a side note, we discuss parametric learning in Appendix A, where parametric machine learning algorithms have the goal of finding the optimal parameter values for the best possible model. All potential values for all model parameters represent what’s known as the parameter space.

Features-wise, each feature of a dataset can take on different value types and ranges. For example, some features might be binary and take on only one of two possible values; others might be text labels for a given category or class (e.g., cat or dog), and some values might be numeric from a continuous range of possible values (e.g., stock price).

The first problem is that for each feature added, an exponentially increasing amount of additional data is needed in order to fill in an adequate amount of values from the newly enlarged feature space. This is so the model can better learn all of the possible combinations of feature values and underlying relationships and correlations. Without additional data to fill in the additional areas of the feature space needed, the model will lose predictive power; that is, the ability to accurately predict a given target value for a given combination of feature values.

The other consequence of adding additional features and the curse of dimensionality is that the computational speed and memory required for model training increases exponentially, as well, which can also increase model training cost. Certain machine learning algorithms in particular are not well suited to handle high-dimensional data, whereas others are better. There are techniques like dimensionality reduction and feature selection that we can use to help with this issue, but more rather than fewer features might be needed, depending on performance and accuracy requirements. This means that balancing the curse of dimensionality with performance needs is a trade-off that requires decisions to be made.

In his book, The Master Algorithm, Pedro Domingos states that no machine-based learning algorithm is immune to the curse of dimensionality and that it’s the second worst problem in machine learning after overfitting (as discussed in Appendixes A and B).

Well-Balanced Data

A related concept is that of having balanced data, which applies to labeled datasets. How balanced a dataset is refers to the proportion of target values in the dataset. Suppose that you have a spam versus not-spam dataset with which you want to train an email spam classifier. If 98% of the data is not-spam emails and only 2% are spam emails, the classifier might not have nearly enough spam examples to learn what real-world spam emails might contain in order to effectively classify all new and not-yet-seen future emails as either spam or not-spam. Having equal proportions of target values is ideal, but that can be difficult to achieve. Often, certain values or classes are simply more rare and therefore unequally represented. There are some data modeling preparation techniques that you can use to try to compensate for this, but they are out of the scope of this discussion.

Highly Representative and Unbiased Data

Another related concept is having representative data. This is similar to having enough data depth to adequately fill the feature space. Having representative data means not only filling the feature space as much as possible, but also representing the range and variety of feature values that a given model will likely see in the real world under all circumstances, present and future. From this perspective, it’s important to make sure that not only does the data have enough variety and combinations of feature values, but also that it covers the real-world ranges and combinations likely to be seen after it’s put in production.

If you’re working with data that is a sample or selection from a much larger dataset, it is important to avoid sample selection bias, or simply sampling bias (a type of selection bias). Avoiding skewed or biased data samples results in highly representative data, as discussed. Randomization is an effective technique to help mitigate sampling bias. Another and much more serious form of bias that should be avoided is known as algorithmic bias, a topic we cover further in Chapter 13.

Complete Data

Data completeness means having all data available that includes leading factors, contributors, indicators, or other ways to describe having the data that has the biggest relationship and influence on the target variable in supervised learning applications. It can be very difficult to create a model to predict something when the data available doesn’t include the factors that contribute to that something’s value the most.

Sometimes, simply adding additional data features can do the trick, whereas other times, new features must be created from existing features and raw data; in other words, the feature engineering process, as previously mentioned. Part of ensuring that your data is complete also includes making sure to deal with any missing values. There are many ways to deal with missing values, such as imputation and interpolation, but further discussion is out of scope here.

Clean Data

Finally, data cleanliness is a critical part of data readiness. Combined with feature engineering and feature selection, data cleaning and preparation are two of the the most critical tasks in AI and machine learning development. Data cleaning and preparation—often also referred to as data munging, wrangling, processing, transformation, and cleaning—are usually handled as part of the actual data-science and modeling process, which I cover in-depth in Appendix B. Data is rarely clean and well suited for machine learning and AI tasks. It usually requires a lot of work to clean and process, and practitioners often say that 80% of AI and machine learning work is cleaning data, and the other 20% is the cool stuff; for example, predictive analytics and natural language processing (NLP). This is the classic Pareto Principle example at work.

We can consider data “dirty” for many different reasons. Often data consists of outright errors. For example, a mistake might have been made when preparing the dataset, and the header doesn’t match the actual data values. Another example would be an email address data feature labeled as “Email” but all values consisting of a phone number. Sometimes values are incomplete, corrupted, or incorrectly formatted. An example could be phone numbers that are all missing a digit for some reason. Maybe you have text strings in the data that should be numbers. Datasets often consist of strange values such as NA or NaN (not a number), as well. Data that is reliable and error free is a measure of data veracity and is highly sought after.

A Note on Cause and Effect

One very important concept worth mentioning is the difference between cause and effect and how this relates to AI, machine learning, and data science. Even though measuring effects as captured in data can be relatively easy, finding the underlying causes that result in the observed effects is usually much more difficult.

In predictive analytics, there are ways to use the parameters of certain model types (which I cover in Appendix A) as an estimate of the effect a certain feature or factor has on a particular outcome, and thus the relative and quantitative impact a predictor has on the target variable that we are interested in and are trying to predict. Likewise, we can use statistical techniques to measure correlations between features, such as how strongly tied to one another they are. Both of these are very useful techniques and provide useful information, but that information can be misleading.

As a completely contrived example to illustrate the point, perhaps we determine that increased marshmallow sales is directly related to rising home prices, and the correlation between the two seems very strong. We could conclude that marshmallow sale causes the effect of increased home prices, but we are smart and know that this is highly unlikely and there must be something else going on. Usually there are other factors at play that we don’t measure or know about (i.e., the aforementioned confounding variables).

In this example, perhaps s’mores have become a super-trendy dessert at restaurants in an area that is experiencing a huge increase in real estate demand and growth due to the influx of large corporations. The true underlying cause of increased home prices here is the influx of corporations, and the increased marshmallow sales is simply a trend in the area, but both are happening at the same time.

Understanding the true underlying causes of a particular effect is ideal because it allows us to gain the deepest understanding and insight, and also make the most appropriate and optimal changes (i.e., pull the right levers by the right amount), to achieve a certain outcome. Various methods of testing and experimentation (e.g., A/B and multivariate) have been devised to determine causal relationships, but these techniques can be difficult or impossible to perform in practice for certain scenarios (e.g., trying to determine the causes of lung cancer). As a result, other techniques have been devised such as observational causal inference, which attempts to gain the same insights from observed data.

Summary

Hopefully, this chapter helped you to better understand the definitions, types, and differences between AI and its related fields. We discussed how humans and machines learn, and that AI and machine learning represent the techniques used by machines to learn from data without requiring explicit programming, and then use the knowledge gained to carry out certain tasks. This is what makes machines exhibit intelligence; it is the secret sauce. It allows humans to use analytics in ways they would otherwise not be able to on their own.

Data science, on the other hand, represents what I call the four pillars of data science expertise (business/domain, math/stats, programming, and effective communication) along with a scientific processes in order to cultivate adequate data and iteratively generate deep actionable insights and develop AI solutions.

We also discussed how data powers AI solutions and the important data characteristics and considerations necessary for AI success. Most important, this includes the concepts of data readiness and quality. Both of these are required for AI success.

With the knowledge gained from this chapter, let’s next discuss real-world opportunities and applications of AI. This should help spark ideas and provide the needed context for developing an AI vision, which is the subject of Part II of this book.

¹ Domingos, Pedro. The Master Algorithm: How the Quest for the Ultimate Learning Machine Will Remake Our World. New York: Basic Books, 2015.

² Pearl, Judea and Dana Mackenzie. The Book of Why: The New Science of Cause and Effect. New York: Basic Books, 2018.

Get AI for People and Business now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.

Start your free trial

AI for People and Business by Alex Castrounis