Chapter 1. Data Science and Deep Learning
Only 20 years have passed since the beginning of the new millennium, and we have thrust the bulk of our technological knowledge into the machine age—an age that, it has been suggested, will bring more change than our earlier discovery of electricity. Change so massive that some believe that all life on our planet will be affected in some way, good or bad. Technologists refer to this change or revolution as machine learning or, more recently, as the dawn of artificial intelligence. While it remains to be seen how intelligent we can make machines, one thing is for sure: this new wave of technology is now everywhere. Developers all over the globe are struggling to make sense of this field and keep up with the changes, as well as trying to benefit from new tools and techniques. Fortunately, companies like Google have realized the difficulties and expense of crafting this powerful new AI and are now commercializing powerful AI services on the cloud. It is the goal of this book to guide the reader through the use of these new and growing AIpowered cloud services from Google.
Note
There is a growing list of other cloud AI providers providing competition for Google, including Microsoft, Amazon, IBM, and many others.
In this chapter we introduce a number of base concepts about machine learning and data science as well as introduce the field of deep learning. The following is a list of topics we will cover in this chapter:

What Is Data Science?

Classification and Regression

Data Discovery and Preparation

The Basics of Deep Learning

Understanding How Networks Learn

Building a Deep Learner
What Is Data Science?
Data science is the practice of applying statistical methods to data in order to ascertain some further characteristics about said data. This could be for the purpose of predicting some future event or classifying some observation of an event. Anyone who has ever checked the weather for the next day has used data science in the form of predicting the weather. In fact, humans have been intuitively practicing data science for thousands of years, and it all started when we learned to predict the weather for tomorrow given the weather from the past.
While we may have been practicing data science in many forms for thousands of years, from weather prediction to engineering, it wasn’t until quite recently that the actual field of data science became well known and coveted. This was due primarily to the big data revolution, which began about 10 years ago. This spawned a broader outlook on computeraided learning about data, which collectively became known as machine learning.
Since machine learning originates from the application of data science, it only makes sense that the two would share a common vocabulary and methodology. As such, we often recommend that anyone seriously interested in developing advanced AI tech like deep learning learn some data science. This will help you not only better grasp the terminology, but also understand the origin or purpose of many techniques. We will address the primary topics in this book, but I suggest that you learn more about data science on your own.
Tip
There are plenty of free courses available online. Just use your favorite search engine and search for “free data science course.”
Now that we understand what data science is and how it relates to machine learning and deep learning, we will move on to looking at how we make sense of data.
Classification and Regression
Data science has developed many ways of exploring and making sense of data, and we often refer to this whole practice as learning. The greater area of machine learning encompasses all forms of learning, including deep learning; reinforcement learning; and unsupervised, semisupervised, and supervised learning, to name just a few. Figure 11 shows an overview of the various forms of learning and how they relate to one another.
As you can see from Figure 11, there is a diverse set of learning methodologies (the unshaded boxes) that encompass machine learning as a whole. Within each learning branch we have also identified the key problems or tasks this learning attempts to tackle, shown with rectangles. Each of these subproblems or tasks spawns numerous additional applications. We use the term adversarial for both semisupervised and unsupervised learning to denote the class of algorithms that selflearn by training against themselves or other similarly matched algorithms. The most famous form of adversarial learner is the GAN, or generative adversarial network. We won’t have much time to go into detail about the methods of unsupervised, semisupervised, or reinforcement learning in this book. However, after gaining the knowledge in this book, and this chapter in particular, you may want to explore those forms on your own later.
At the middle right of Figure 11 is the area of supervised learning and its various branches. This is what we will focus on in this text, particularly the areas of regression and classification. Supervised learning is so named because it requires that the data first be labeled before being fed into the learning algorithm. An example is a dataset showing the amount of accumulated rainfall in millimeters (30 millimeters = 1 inch) over the course of 12 months, shown in Table 11.
Month  Min rainfall  Max rainfall  Total rainfall 

1 
22 
30 
24 
2 
22 
25 
48 
3 
25 
27 
75 
4 
49 
54 
128 
5 
8 
8 
136 
6 
29 
47 
168 
7 
40 
41 
209 
8 
35 
63 
263 
9 
14 
25 
277 
10 
45 
57 
333 
11 
20 
39 
364 
12 
39 
51 
404 
The data shows monthly precipitation values from fictional ground stations in a mythical country or location. To keep things simple, we are going to contrive our data for the first few examples. Over the course of the book, though, we will look at plenty of real datasets. As we can see, the data is labeled with a number of attributes: month, minimum rainfall, maximum rainfall, and total accumulated rainfall. This will work as an excellent example of labeled data, which we can use to perform supervised learning of regression and classification later in this chapter. Before that, let us take a close look at regression in the next section.
Regression
Regression is the process of finding the relationship between dependent variables and independent variables. The most common form of regression is linear regression, so named because it assumes a linear relationship between variables. Figure 12 is an example of drawing a line of regression through that set of weather data previously shown in Table 11. Plotted was the independent variable month against the dependent last column, total rainfall. For this simple example, we use only two variables. The plot was generated with Google Sheets, which provides linear regression as a data analysis tool out of the box, using the Trendline option under Customize Series.
Note
Independent variables in an equation are separate and outside the influence of other variables. The variable x in the equation $y=mx+b$ is the independent variable. The dependent variable is y because it is derived from x.
Placing a trendline is the same as performing regression against our plotted data. In this case we plotted the month number against the accumulated total rainfall for the year. Using the total accumulation of rain rather than an actual amount per month likewise simplifies our placement of a trendline. The regression we are using in the example is linear, and Google Sheets also allows us to derive and show the equation of the line. Check the charts legend and note the equation is in the form $y=mx+b$, or in other words, linear. You will also notice in the legend another value called R^{2}, or what we call R squared. R squared is used as a measure of goodness of fit (i.e., how well the predicted values from regression match the actual data), and because the value ranges to a maximum of 1.0, it often provides a good baseline measure. However, R squared is not our preferred method for determining goodness of fit, and we will talk about better methods in the next section.
Goodness of Fit
The primary problem with R squared is that it actually does not measure goodness of fit. What we find is that the more varied the data is, the larger the standard deviation is, and the lower the values of R squared are. Thus, R squared generally indicates lower values over more diverse and larger datasets, and this makes it useless in deep learning. Instead we apply an error function against our predicted and actual values, taking the difference squaring and then averaging it. The result of this is known as the average or mean squared error (MSE). Figure 13 shows how we would calculate the MSE from our last example. Inside the diagram is the equation that essentially means we take the expected, predicted value with regression and subtract that from the actual. We square that number to make it positive and then sum all of those values. After that, we divide by our total samples to get an average amount of error.
MSE is a relative measure of error and is often specific to your dataset. While MSE does not give us a general quality of fit like R squared, it does give us a relative indication of goodness of fit. This means that lower values of MSE indicate a better goodness of fit. There are other similar measures we can use to determine how well a regression model fits to the data. These include root mean squared error (RMSE), which is just the root of MSE, and mean absolute error (MAE), which measures the independent difference between variables. Determining goodness of fit will ultimately determine the quality of our models and is something we will revisit often throughout the rest of the book.
In the next section we look at a different form of regression: logistic regression, or what we commonly refer to as classification.
Classification with Logistic Regression
Aside from regression, the next common problem we will look to solve is classifying data into discrete classes. This process is known as classification, but in data science we refer to it as logistic regression. Logistic means logit or binary, which makes this a binary form of regression. In fact, we often refer to this form of regression as regression with classes or binary regression, so named because the regression model does not predict specific values, but rather a class boundary. You can think of this as the equation of regression being the line that separates the classes. An example of how this looks/works is shown in Figure 14.
In Figure 14, we see our example rainfall data again, but this time plotted on month and maximum rainfall for a different year. Now the purpose of the plot is to classify the months as rainy (wet) or dry. The equation of regression in the diagram denotes the class break between the two classes of months, wet or dry. With classification problems, our measure of goodness of fit now becomes how well the model predicts an item is in the specific class. Goodness of fit for classification problems uses a measure of accuracy, or what we denote as ACC, with a value from 0 to 1.0, or 0% to 100%, to denote the certainty/accuracy of data being within the class.
Tip
The source for Figure 14 is a free data science learning site called Desmos. Desmos is a great site where you can visualize many different machine learning algorithms. It is also highly recommended for anyone wanting to learn the fundamentals of data science and machine learning.
Referring back to Figure 14, it is worth mentioning that the logistic regression used here is a selfsupervised method. That means we didn’t have to label the data to derive the equation, but we can use supervised learning or labeled data to train classes as well. Table 12 shows a sample rainfall dataset with classes defined. A class of 1 indicates a wet month, while a class of 0 denotes a dry month.
Month  Dry/Wet (0 or 1) 

1 
1 
2 
0 
3 
0 
4 
1 
5 
0 
6 
1 
7 
0 
8 
1 
9 
0 
10 
1 
11 
0 
12 
1 
It is easy to see from Table 12 which months break into which classes, wet or dry. However, it is important to note how we define classes. Using a 0 or 1 to denote whether a data item is within a class or not will become a common technique we use later in many classification problems. Since we use accuracy to measure fit with classification, it also makes this type of model more intuitive to train. If your background is programming, though, you may realize that you could also classify our sample data far more easily with a simple if
statement. While that is true for these simple examples of single dependent variable regression or classification, it is far from the case when we tackle problems with multiple dependent variables. We will cover multivariable learning in the next section.
Multivariant Regression and Classification
The example problem we just looked at was intended to be kept simple in order to convey the key concepts. In the real world, however, data science and machine learning are far from simple and often need to tackle far more complex data. In many cases, data scientists look at numerous independent variables, or what are referred to as features. A single feature denotes a single independent variable we would use to describe our data. With the previous example, we looked at only one independent variable, the month number for both problems of regression and classification. This allowed us to derive a relationship between that month number (a feature) and a dependent variable. For regression, we used total monthly rainfall to determine the linear relationship. Then for classification we used maximum monthly rainfall to determine the month’s class, wet or dry. However, in the real world we often need to consider multiple features that need to be reduced to a single value using regression or classification.
Note
The data science algorithms we look at here for performing regression and classification were selected because they lead into the deep learning analogs we will look at later. There are numerous other data science methods that perform the same tasks using statistical methods that we will not spend time on in this book. Interested readers may wish to explore a course, book, or video on data science later.
In the real world, data scientists will often deal with datasets that have dozens, hundreds, or thousands of features. Dealing with this massive amount of data requires more complex algorithms, but the concepts for regression and classification are still the same. Therefore, we won’t have a need to explore finer details of using these more complex classic statistical methods. As it turns out, deep learning is especially well suited to learning data with multiple features. However, it is still important for us to understand various tips and tricks for exploring and preparing data for learning, as discussed in the next section.
Data Discovery and Preparation
Machine learning, data science, and deep learning models are often very much dependent on the data we have available to train or solve problems. Data itself can represent everything from tabular data to pictures, images, videos, document text, spoken text, and computer interfaces. With so much diversity of data, it can be difficult to establish welldefined crosscutting rules that we can use for all datasets, but in this section we look at a few important considerations you should remember when handling data for machine learning.
One of the major hurdles data scientists and machine learners face is finding goodquality data. Of course, there are plenty of nice, free sample datasets to play with for learning, but when it comes to the real world, we often need to prepare our own data. It is therefore critical to understand what makes data good or bad.
Bad Data
One characteristic of bad data is that it is duplicated, incomplete, or sparse, meaning it may have multiple duplicated values or it may be missing values for some or many features. Table 13 shows an example of our previous mythical rainfall data with incomplete or bad data.
Month  Min rainfall  Max rainfall  Total rainfall 

1 
22 
30 
24 
2 
22 
25 
48 
3 
25 

4 
49 
54 
128 
5 
8 
8 
136 
6 
47 
168 

7 
40 
41 
209 
8 
35 

9 
14 
277 

10 
45 
57 
333 
11 

12 
39 
51 
404 
Now, if we wanted to perform linear regression on the same dataset, we would come across some issues, the primary one being the missing values on the labeled dependent variable total rainfall. We could try and replace the missing values with 0, but that would just skew our data. Instead, we can just omit the data items with bad data. This reduces the previous dataset to the new values shown in Table 14.
Month  Min rainfall  Max rainfall  Total rainfall 

1 
22 
30 
24 
2 
22 
25 
48 
4 
49 
54 
128 
5 
8 
8 
136 
… 
… 
… 
… 
7 
40 
41 
209 
… 
… 
… 
… 
10 
45 
57 
333 
12 
39 
51 
404 
Plotting this data in Google Sheets and applying a trendline produces Figure 15. As you can see in the figure, the missing values are not much of an issue. You can clearly see now how removing the null values also shows us how well regression performs. Pay special attention to where the missing months should be, and look at how well the trendline or regression equation is predicting these values. In fact, data scientists will often remove not only bad data with missing null values, but also good data. The reason they remove good data is to validate their answer. For instance, we can go back to our full sample data as shown in Table 11 and use some of those values to validate our regression. Take month 3, where the accumulated value is 75. If we consider the predicted value for month 3 from Figure 15, we can see the value is predicting around 75. This practice of removing a small set of data for testing and validating your answers is fundamental to data science, and is something we will cover in the next section.
Training, Test, and Validation Data
A fundamental concept in data science is breaking source data into three categories: training, test, and validation. We set aside the bulk of the data, often about 80%, for training. Then we break the remaining data down into 15% test and 5% validation. You may initially think this could compromise your experiment, but as we saw, removing small amounts of data increased the confidence in our model. Since model confidence is a key criterion for any successful machine learning model, removing a small percentage of data for testing and validation is seen as trivial. As we will see, setting aside data for testing and validation will be critical for evaluating our performance and baselining our models.
Tip
Another critical purpose for breaking out data into test and validation is to confirm that the model is not over or underfitting. We will cover the concepts of over and underfitting when we get to deep learning later in this chapter.
Good Data
Aside from the obvious missing, duplicate, or null features, characterizing data as good is subjective to the machine learning technique. When using all classical data science methods, we almost always want to verify the dependency between variables. We do this to make sure the two dependent variables are not strongly dependent on one another. For instance, in our previous rainfall example, the total accumulated rainfall per month would be heavily dependent on the maximum monthly rainfall. Therefore, most classic data science methods would discourage using both variables since they heavily depend on each other. Instead, those methods strongly encourage variable independence, but again, this is often not the ideal when it comes to the real world. This is also where we see the true benefit of deep learning methods. Deep learning has the ability to work with independent, dependent, and sparse or missing data as well as if not better than any other statistical method we have at our disposal. However, there are still some common rules we can use to prepare data for all types of learning, as discussed in the next section.
Preparing Data
The type of preparation you need to perform on your data is quite dependent on the machine learning algorithm being employed and the type of data itself. Classical statistics–based methods, like the ones we used for regression and classification earlier, often require more data preparation. As we saw, you need to be careful if the data has null, duplicate, or missing values, and in most cases you either eliminate those records or annotate them in some manner. For example, in our previous rainfall example, we could have used a number of methods to fill in those missing data values. However, when it comes to deep learning, as we will see shortly, we often throw everything at the learning network. In fact, deep learning often uses data sparsity to its advantage, and this strongly goes against most classic data science. If anything, deep learning suffers more from toosimilar data, and duplicated data can be especially problematic. Therefore, when preparing data for deep learning we want to consider some basic rules that are outside the norm for data science:
 Remove duplicated data

Duplicated data is often an issue for deep learning, and for data science in general. Duplicates provide extra emphasis to the duplicated rows. The one exception to this will be timebased data, or where duplicate values have meaning.
 Maintain data sparsity

Avoid the temptation to fill in data gaps or remove data records due to missing values. Deep learning networks generalize data better when fed sparse data or when the network itself is made sparse. Making a network layer sparse is called dropout, which is a concept we will cover in later chapters.
 Keep dependent variables

Data scientists will often reduce the number of features in large datasets by removing highly dependent features. We saw this in our rainfall example where the total rainfall in a month was highly dependent on the maximum rainfall in that month. A data scientist would want to remove the dependent feature, whereas a deep learner would likely keep it in. The reason for this is that, while the feature is observed to be highly dependent, it may still have some independent effect.
 Increase data variability

In most data science problems, we often want to constrain data variability in some manner. Reducing data variation allows a model to train quicker and results in a better answer. However, the opposite is often the case with deep learning, where we often want to expose the model to the biggest variation to encourage better generalization and avoid false positives. We will explore why this can be an issue in later chapters.
 Normalize the data

Normalizing the data is something we will cover in more detail as we go through the various examples. We do this in order to make features unitless and typically range in value from –1 to +1. In some cases you may normalize data to 0 to 1. In any case, we will cover normalization later when it pertains to the relevant sample.
Aside from applying these general rules to your data, you will also want to understand what it is you want from your data and what your expectations are. We will cover this in more detail in the next section.
Questioning Your Data
One key observation we need to make with any dataset before applying a data science or machine learning algorithm is determining how likely the expected answer is. For example, if you are training an algorithm to guess the next roll on a sixsided die, you know that an algorithm should guess correctly one out of six times, or 1/6th, at a minimum. If your trained algorithm guessed the correct answer 1 out of 10 times on a 6sided die then this would indicate very poor performance since even a random guess is likely correct 1 out of 6 times. Aside from understanding the baseline expectation, here are some other helpful questions/rules, again skewed more toward deep learning:
 Evaluate baseline expectation

Determine how likely a random guess is to get the correct answer.
 Evaluate maximum expectation

How likely is your model to get the best answer? Are you constraining your search to a very small space, so small that even finding it could be problematic? For example, assume we want to train a network to recognize cats. We feed it 1 picture of a cat and 10,000 pictures of dogs, which we train the network to recognize. In that case, our algorithm would have to correctly identify 1 cat out of 10,001 pictures. However, with deep learning, since our network was trained on only one cat picture, it will only recognize one exact, more or less, cat. The takeaway here is to make sure the data covers as much variety as possible—the more, the better.
 Evaluate least expectation

Conversely, how likely is your algorithm to get the wrong answer? In other words, is the random guess or base expectation very high to start? If the base expectation is above 50%, then you should reconsider your problem in most cases.
 Annotate the data

Are you able to annotate or add to the data in some manner? For instance, if your dataset consists of dog pictures, what if you horizontally flipped all pictures and added those? This would in essence duplicate your data and increase your variability. Flipping images and other methods will be explored later in relevant exercises.
Make sure to always review the first three rules from this list. It is important to understand that the questions have answers in your data, and that the answers are obtainable. However, the opposite is also very true, and you need to make sure that the answer is not so obvious. Conversely, unsupervised and semisupervised learning methods are designed to find answers from the data on their own. In any case, when performing regression or classification with supervised learning, you will always want to evaluate the expectations from your data.
A common practice now is to construct unsupervised and semisupervised deep learning networks to extract the relevant features from the data, and then train on those new features. These networks are able to learn, on their own, what features have relevancy. This practice is known as autoencoding, and is one of the first types of networks we will learn later in this chapter.
The Basics of Deep Learning
Deep learning and the concept of connected learning systems that function similarly to a biological brain have been around since the 1950s. While deep learning is inspired by biology, it in no way attempts to model a real biological neuron. In fact, we still understand very little of how we learn and strengthen the connections in any brain; however, we will need to understand in great detail how the connections strengthen or weaken—how they learn—in the deep learning neural networks we build.
Tip
The Deep Learning Revolution by Terrence J. Sejnowski (MIT Press) is a fantastic book on the history and revolution of deep learning. Sejnowski is considered a founding father of deep learning, which make his tales about its history more entertaining.
In early 2020, stateoftheart deep learning systems can encompass millions of connections. Understanding how to train such megalithic systems is outside the scope of this book, but using such systems is not. Google and others now provide access to such powerful deep learning systems through a cloud interface. These cloud interfaces/services are simple to use, as we will see in later chapters. However, understanding the internal workings of a deep learning system will make it easier to identify when things go wrong and how to fix them. As well, understanding the simplicity of these systems will likely take away any apprehension or intimidation you feel about deep learning. Therefore, we will start with the heart of the deep learning network, the perceptron.
The perceptron is central to a deep learning system. You can think of the perceptron as being analogous to the engine in a car, except in a car there is a single engine, while in a deep learning system there may be thousands of perceptrons all connected in layers. Figure 16 shows a single perceptron with a number of input connections and a single output controlled by an activation function.
We can picture all activity flowing through Figure 16 from the left to the right. Starting at the far left, the inputs are labeled X_{1} to X_{3} to show three inputs. In a real network the number of inputs could be in the thousands or millions. Moving from the inputs, we then multiply by a weight for each input denoted W_{1} to W_{4}. The weights represent how strong the connection is; thus a highervalue weight will have a stronger connection. Notice that we multiply the first weight by one; this is for the bias. After all the weights are multiplied by the inputs, they are summed at the next step, denoted by the Greek symbol ∑ for summation. Finally, the total sum is pushed through an activation function and the result is output, which is dependent on the activation function. It may help to think that the activation function controls how the perceptron fires and passes its output along. This whole process is called a forward pass through the network; it is also called inference, or how the network answers.
Generally, we will try to minimize the use of math to explain concepts in this book. However, math is a core element of this technology, and it is sometimes easier and more relevant to express concepts in terms of math equations. Therefore, we will start by showing how a perceptron fires mathematically, as shown in Equation 11.
Equation 11.
$$y={W}_{1}+\sum _{i=1}^{n}{x}_{i}\times {W}_{i+1}$$Where:

y = output sent to activation function

W = a weight

x = an input
Equation 11 shows the summation part of the forward pass through a single perceptron. This is just where the weights are multiplied by the inputs and everything is added up. It can also be helpful to view how this looks in code. Example 11 shows a function written in Python that performs the summation step in a perceptron.
Example 11.
def
summation
(
inputs
,
weights
):
sum
=
weights
[
0
]
for
i
in
range
(
len
(
inputs
)

1
):
sum
+=
weights
[
i
+
1
]
*
inputs
[
i
]
return
sum
After summation the result is passed into an activation function. Activation functions are critical to deep learning, and these functions control the perceptron’s output. In the single perceptron example, an activation function is less critical, and in this simple example we will just use a linear function, shown in Example 12. This is the simplest function, as it just returns a straight mapping of the result to the output.
Example 12.
def
act_linear
(
sum
):
return
sum
Example 13 shows a step activation function, so named because the output steps to a value when the threshold is reached. In the listing, the threshold is >= 0.0 and the stepped output is 1.0. Thus, when a summed output is greater than or equal to zero, the perceptron outputs 1.0.
Example 13.
def
act_step
(
sum
):
return
1.0
if
sum
>=
0.0
else
0.0
Note
The code examples here are meant for demonstration only. While the code is syntactically correct and will run, don’t expect much from the output. This is because the network weights still need to learn. We will cover this later in this chapter.
Finally, we can put all of this code together in Example 14, where we have written a forward_pass
function that combines summation and the earlier linear activation function.
Example 14.
def
forward_pass
(
inputs
,
weights
):
return
act_linear
(
summation
(
inputs
,
weights
))
(
forward_pass
([
2
,
3
,
4
],[
2
,
3
,
4
,
5
])
Can you predict the output of Example 14 and previous related listings? Try to predict the outcome without typing the code into a Python interpreter and running it. We will leave it as an exercise for the reader to find the answer on their own. While the code in the previous example may seem simple, there are a number of subtle nuances that often trip up newcomers. Therefore, we will reinforce the concept of the perceptron further in the next section by playing a game.
The Perceptron Game
Games and puzzles can be a fun, engaging, and powerful way to teach abstract concepts. The Perceptron Game was born out of frustration from teaching students the previous coding example and realizing that 90% of the class often still missed major and important concepts. Of course, many other deep learners, including the godfather himself, Dr. Geoff Hinton, have been said to use variations of a similar game. This version can be played as a solitaire puzzle or as a group collaboration. It really depends on how many friends you want to play with. One thing to keep in mind before inviting the family over is that this game is still heavily mathfocused and may not be for everyone.
Note
You can find all of the printable materials for the game in the book’s source code download for Chapter 1.
The play area for the Perceptron Game is a perceptron, or in this case a printed mat like that shown in Figure 17. This is the same figure we saw previously, but this time it is annotated with some extra pieces. Aside from printing out the play area, the perceptron mat, you will need to find about eight sixsided dice. You can use fewer dice, but the more, the better. We will use the dice as numeric placeholders. For the most part, the number on each die face represents its respective value, except for 6, which takes 0.
Thus, the value for each die face is:

1 = 1

2 = 2

3 = 3

4 = 4

5 = 5

6 = 0
Given the die positions on the mat shown in Figure 17, we can see there are two inputs represented by 2 and 3. Inside the perceptron we have weights set to 0, 1, and 4 (remember that 6 = 0). Based on these inputs and weights, we can calculate the total summation by:

bias = 1 × 0 = 0

input 2 = 2 × 1 = 2

input 3 = 3 × 4 = 12
Total sum = 0 + 2 + 12 = 14
The total sum now needs to be output through an activation function. For simplicity, we will say that our current perceptron does not use an activation function. This means that all of the outputs will be linear or raw values. So 14 becomes the output value for the perceptron, except assume that the real answer we want, the labeled answer, is 10. That means the perceptron has to learn the weights through mircoadjustments to provide the right output. Fortunately, there is a relatively simple equation, Equation 12, that can do that.
Equation 12.
$${W}_{i}={W}_{i}+\alpha (LO)$$Where:

L = the labeled value

O = the output from the perceptron

W = the weight to be adjusted

⍺ = training constant
Equation 12 adjusts each weight by a factor controlled by alpha (⍺), and which is a result of the difference in actual value and one predicted (forward pass) in the perceptron. Going back to our last example, we can correct one of the sample weights shown by substituting values into Equation 12 and assuming a value of 0.1 for ⍺ and the weight to be 4 (above example), we get Equation 13.
Equation 13.
$$3.6=4+0.1(1014)$$Thus, from Equation 13 we can see the new value for the weight would be 3.6. Now, if we put those values back into the equation, the new output for the perceptron would be 12.8. However, the right answer is still 10. This is okay because we don’t want to adjust a single weight too quickly. Remember that this is only one input, and we may need to adjust for thousands or millions of inputs, which is why we only set ⍺ to a small value. By using only a small value, we can then incrementally go through the inputs over and over again until the perceptron weights learn. Going back to the previous example with actual answer 10 and output 14, we can perform weight updates iteratively, as shown in Table 15.
X1  X2  W1  W2  Label  Ouput  Error 

2 
3 
1 
4 
10 
14 
4 
2 
3 
0.6 
3.6 
10 
12 
2 
2 
3 
0.4 
3.4 
10 
11 
1 
2 
3 
0.3 
3.3 
10 
10.5 
0.5 
2 
3 
0.25 
3.25 
10 
10.25 
0.25 
2 
3 
0.225 
3.225 
10 
10.125 
0.125 … 
By iteratively adjusting weights, we can see how the perceptron converges to an answer for a single set of inputs. We of course want to look at far more complex problems, hence the reason for the game.
Note
Before you throw this book down and yell, “Eureka! I know how neural networks learn,” wait and take a step back. Real networks use a far more complex method called backpropagation, which is coming up in this chapter.
The goal of the Perceptron Game is to find the weights that will solve for the correct outputs. In Table 16, there is a list of single inputs and the expected outputs. What you need to do is find the weights (weight 1 for the bias and weight 2 for the input) that will let the perceptron predict the correct output.
X1  Expected output 

4 
14 
3 
11 
2 
8 
1 
5 … 
Now you have a number of options to use to learn or set the weights. You can:

Guess: As a human you may be able to intuitively figure the answer out in your head. Try to guess what the weights are first.

Use the random method: Use the dice and roll random values. Then try those random values and see if those work. As a hint, the bias (weight 1) and input 1 (weight 2) weights are not the same value and are not zero (6 on a die).

Use Equation 12: Use the equation we looked at earlier to solve for the weights. If you get stuck, this may be a good method to fall back on.

Use programming: We will frown upon programming as a solution in this chapter, but only in this chapter. Leave it for later.
Tip
Even if you guess the answer quickly, try using the random method as well. Understanding how different methods solve for the weights is the point of this exercise.
The answer to this problem (and the others) is provided at the end of the chapter. We didn’t want readers to spot the answers while doing the problem. When you are done, check your answer at the back of the chapter and then move on to the next perceptron puzzles in Tables 17 and 18.
X1  X2  Expected output 

4 
2 
8 
3 
1 
5 
2 
0 
2 
1 
3 
7 
0 
4 
8 
5 
5 
15 
X1  X2  X3  Expected output 

4 
2 
1 
8 
3 
1 
0 
5 
2 
0 
2 
2 
1 
3 
3 
7 
0 
4 
4 
8 
5 
5 
5 
15 
There can be multiple answers to these games depending on how you solve them. We arrived at the answers at the end of the chapter by guessing, and yours may differ if you used Equation 12, for instance. Either way, if your perceptron is able to regress the right output and you understand how this is done, you are well on your way.
With regression under our belt, it is time to move on to classification. Now we are interested in classifying something as either in a class or not; that is, a day is wet or dry, cold or hot, cloudy or sunny. However, to do this correctly, we have to step our output through an activation function. Using an activation function, particularly the step function, will allow us to better classify our output. Refer back to Example 13 to review the step function, but essentially, if the output is less than zero, nothing is output; 1.0 is output otherwise. Now, if we consider the game in Table 19, the output is shown as a class 0 or 1.
X1  Expected output 

4 
0 
3 
0 
2 
1 
1 
1 … 
Programmatically, you could likely solve Game 4 in seconds, but what weights would you need to solve the perceptron that could properly classify those outputs? Well, the problem is that it can’t be done using our toolset thus far. Go ahead and try Equation 12, but you’ll find that it doesn’t work—not for a single perceptron anyway. However, we can solve this by adding a couple more perceptrons, as shown in Figure 18. In this figure, we can see three perceptrons connected, two input and one output. We call each set of perceptrons a layer. Therefore, the figure has an input layer with two perceptrons and one output layer with a single perceptron. Inside these perceptrons there are four input weights (bias + input × 3) in the input layer and two weights in the output layer. Are you able to balance these weights now to provide the correct output? Give it a try.
Note
In classroom settings, we typically have students form groups and pretend each is a perceptron. They are then told to organize themselves in layers and solve the various weights in the problem.
For the last game (Table 110), we want to increase the number of outputs from one class output node to two. This means we also need to put two perceptrons in the output layer.
X1  X2  X3  Y1  Y2 

4 
0 
2 
0 
1 
5 
1 
3 
0 
1 
3 
2 
4 
0 
1 
2 
3 
5 
1 
0 
1 
4 
0 
1 
0 
0 
5 
1 
1 
0 
Likewise, we are adding another input, and therefore it makes sense to also add another input perceptron. We then end up with the multilayer perceptron network shown in Figure 19. The figure shows a network with 3 input perceptrons each taking 3 inputs for a total of 12 weights ([3 input + bias] × 4). Then in the second (output) layer of 2 perceptrons, we have 8 weights ([3 input + bias] × 2) for a total of 20 weights. The game is far more simple than it first appears, and the trick is to follow the zeros.
After you solve each problem, consult the end of the chapter for the answer. You may be able to solve the games completely in your head, but it can also help to physically use a mat and dice to try to solve the games randomly. However, you should also consider at this point how you might apply Equation 12, the perceptron learning equation, to a multilayer perceptron. The short answer is that you can’t, and we will look at why this is the case in the next section.
Understanding How Networks Learn
As we’ve seen by playing the Perceptron Game, when we start to combine multiple perceptrons into layers, things get complicated quickly. We call those multilayer perceptron models neural networks or advanced neural networks, or more recently, deep learning systems. Whatever we call them, when we scale from a single perceptron to even just a few, solving for the amount to update a single weight in the entire system becomes very complicated. You likely already realized that when playing the game, but hopefully you also figured out that solving the weights becomes systematic. That is, once you have one weight figured out, you can move backward and solve the rest. This system of solving the weights by moving backward is how we solve the weights in networks today. That system is called backpropagation, and we will delve into greater detail on it next.
Note
As you’ve already seen, there are numerous ways to solve the weights in a network. Randomizing was often a good solution before networks became far too complex. However, the preferred method is now backpropagation, though that may change in the future.
Backpropagation
While Equation 12 will work for updating or learning the weights in a single perceptron, it is not able to find the updates across an entire network. In order to do this, we fall back to calculus, which is able to determine how much change or effect each weight has on the network output. By determining this, we can work backward and determine how much each weight in the network needs to be updated or corrected. This system is called backpropagation. The complicated parts come from calculus, but fortunately the whole system can be automated with a technique called automatic differentiation. However, it still is important to intuitively understand how this system works in the event something goes wrong. Problems will and do often happen, and they are the result of something called vanishing or exploding gradients. Therefore, to help you understand if you have a vanishing or exploding gradient, we will explore backpropagation in some detail.
In order to determine the amount of change of each weight, we need to know how to calculate the amount of change for the entire system. We can do this by taking the equation that gives us the forward answer, or prediction, and differentiating it with calculus. Recall that calculus gives us the rate of change of an equation or system. For basic calculus with one variable, this is elementary, and Figure 110 shows how a function can be differentiated at a single point to find the gradient or change at that point.
Note
If you have a foundational knowledge of calculus, you should be able to understand the upcoming material even if it has been some time since you practiced calculus. However, those readers with no knowledge of calculus should explore that material further on their own. There are plenty of free videos or courses online that can provide this knowledge.
Now that we understand why calculus is essential, we can move on to solving our equations. However, with a multilayer perceptron, each perceptron has its own weights, summation, and activation functions, so differentiating all of this sounds quite complex. The short answer is yes, it very much used to be, but we have found tricks to dramatically simplify the problem. If we consider that each layer of perceptrons uses the same activation function, then we can treat the entire layer as a linear system of equations, thus reducing a single layer down to a single function such as f(). Incidentally, reducing a layer down to a linear system of equations, or in other words a matrix, reduces the computational complexity immensely as well. This is how all deep learning systems work internally today, and this is what makes processing a layer through a network so fast. It is also the reason that deep learning systems now surpass humans in many tasks that we previously thought we would never be surpassed on.
Note
Included in your required math background is linear algebra. Linear algebra helps us solve linear systems of equations and create some cool 3D visuals, like games. It is another mathematical tool that can help you understand deep learning and other systems, likely more so than calculus.
By reducing an entire layer down to a system of equations, we can then assume a single function f(). Each successive function then would apply itself to f, such as g(f()), where the g function is a second or successive layer in a network. Remember that the output from the first layer feeds into the second layer, or function, and so on. We can solve this function by using the chain rule, as demonstrated in Equation 14.
Equation 14.
$$h\left(x\right)=g\left(f\right(x\left)\right)$$In Equation 14, we can use the chain rule from calculus, which tells us that any equation in the first form can then be differentiated in the second form. This gives us a method to differentiate each of the layers, and then by using some more math magic, we can derive the set of specific weight update equations shown in Figure 111.
Figure 111 shows the highlevel steps of reducing the perceptron forward function at the top of the figure into a gradient weight function we can use to update weights in training at the bottom. The mathematics show how the forward function is first derived with respect to x, the inputs, into the secondtolast equation, where the last equation differentiates the function with respect to the weights (w). By differentiating with respect to w, we can determine the gradient or amount of change each weight contributes to the final answer.
This equation shows the calculus for deriving the gradient of change for each weight. Gradients represent an amount and direction of change. Thus, by finding the gradient, we can understand the amount the individual weight or parameter contributed to the output error. We can then reverse the gradient and adjust the weight in the opposite direction. Keep in mind that each time we change a weight, we want to change the smallest amount possible. This way a change in the weight won’t cause another weight to get unbalanced. This can be quite tricky when we train thousands or millions of weights, so we introduce a learning rate called alpha. Remember that we used alpha in our single perceptron example to set the amount of change or improvement in each iteration, and the same applies here. Except in this case, we need to make alpha a much smaller value, and in most cases the value is 0.001 or less.
Alpha, the learning rate of a network, is a common parameter we will see over and over again, and it is used to tune how fast a network trains. Set the value too low, and the network learns very slowly, but it may avoid certain training pitfalls. Set alpha too high, and the network learns quickly but then will likely become unstable. Instead of converging to an answer, it will likely give a wrong answer. These issues occur because the network may get stuck in some local minimum, as shown in Figure 112, where the goal of any network is to find the global minimum or maximum value.
Optimization and Gradient Descent
The whole process of backpropagation is further described as using gradient descent, so named because backpropagation finds the gradient that describes the impact of an individual weight. It then reverses the direction of the gradient and uses that to find the global minimum of the solution. We refer to this entire process of optimizing a solution to a global minimum as optimization, because we reduce the total errors of a solution to a global minimum. Optimization itself is fundamental to data science and is used to describe the method that minimizes the errors of a method. Minimizing error is relative to the function being performed—either regression or classification, for instance—and uses the specific function error metric to determine performance. For example, with regression we may minimize on MSE.
Optimizers come in several variations, but many of the ones we use for deep learning are based on gradient descent, the backpropagation method. Here is a list of optimizers we will cover in this book:
 Gradient descent

This is the base algorithm, and it works as described in the section on backpropagation.
 Stochastic gradient descent (SGD)

This is an improved version of gradient descent that uses random batch sampling to improve on generalization. This is the actual standard, and we will devote a whole section to this method later.
 Nesterov

This method introduces the concept of momentum. Momentum is like an additional speed control for SGD and allows it to converge quicker. Nesterov provides an additional speed boost to momentum as well.
 AdaGrad

This is a form of gradient descent that adjusts to how frequent the data is. This, in turn, gives it an advantage when handling sparse data. Data associated with infrequent features with higher value will benefit more when using AdaGrad. However, this method does suffer from diminishing learning rates. This method also introduces the concept of adaptive learning rates.
 AdaDelta

This method is based on AdaGrad but improves on it by not requiring an initial learning rate (alpha) as the algorithm will adjust on its own. It also manages the diminishing learning rates better.
 RMSprop

This is a version of AdaDelta that was independently developed by Geoff Hinton.
 Adaptive Moment Estimation (Adam)

This is an extension to AdaDelta and RMSprop that allows finer control over the momentum parameters. Adam is also one of the more popular optimizers you may encounter in recent papers.
 AdaMax

This is an improvement to Adam that updates the momentum parameters.
 Nadam

This is a combination of Nesterov and RMSprop, which is like supercharging the momentum on RMSprop.
 AMSGrad

This is a new gradient descent algorithm with momentum that intends to improve on methods like Adam where it is shown that using SGD with momentum works just as well or better. This method is becoming the goto when Adam does not perform as well as may be expected.
This list has doubled since 2012, and it likely could double again in a few short years.
Note
You may find yourself generally sticking to a few standard optimizers for various classes of problems. A lot of this depends on the problem you are trying to solve and the data you are using. We will of course explore further details about optimizers in later chapters as we solve particular problems.
Vanishing or Exploding Gradients
Generally, the whole system of backpropagation (gradient descent) and finding the partial derivative with respect to each weight works automagically. That is, most deep learning libraries like Keras, TensorFlow, and PyTorch provide automatic differentiation of the partial derivative of a network out of the box. While this is incredibly powerful, and a blessing for those of us who used to do it by hand, it still has some problems. While we generally won’t encounter these issues until we look at larger and more complex networks, it is worth mentioning here.
Occasionally, and for a variety of reasons, the gradient descent optimization algorithm may start to calculate exploding or vanishing gradients. This may happen for the various optimizers we covered earlier. Remember, a gradient denotes the amount and direction a weight contributes to the network. An optimizer may start to calculate an incredibly large value for a gradient, called an exploding gradient, or conversely, very small or vanishing gradients. In the case of exploding gradients, the network will start to generally overpredict, while in the case of vanishing gradients, the network will just stop learning and freeze. To help diagnose these issues early, use the following guide:

The network does not improve after x number of iterations.

The network is unstable, and you see large changes in error moving from positive to negative.

The network appears to go backward in learning.
The best way to diagnose these issues is by watching and monitoring how your network trains. In most cases, you will want to closely observe your network training for the first several thousand iterations. In the next section, we’ll discuss further optimization when training networks.
SGD and Batching Samples
One problem we may come across when training thousands of data through a network is that the process can take a long time and isn’t general enough. That is, if we update our network for each individual weight, we may find elements that cancel each other out. This can be further compounded if the data is pulled from the same order. To alleviate these problems, we introduce a random batching approach to updating our network. Batching the data into groups and then applying changes averaged across those groups better generalizes the network, which is usually a good thing. Furthermore, we randomize this batching process so that no two batches are alike and data is further processed randomly. This whole technique is called stochastic gradient descent when used with backpropagation to train a deep learning network.
We use the term stochastic to mean random since we are now pulling random groups of samples. The gradient descent part is the heart of backpropagation optimization, as we already learned. SGD is the standard optimizer, as we saw earlier. There are plenty of variations to SGD that are more powerful, and we will explore those as well. The important thing to remember about SGD and other optimizers is that they use batches of data and not individual samples. As it turns out, since we are using linear systems of equations, this also becomes more computationally efficient.
batch_size
is used to determine updates to the network. Typical batch sizes are 32–256 for large dataset sizes. The batch size is a deceptive parameter that may or may not have an incredible impact on network training. It generally will be one of the first parameters you tune to enhance a network. Smaller values of batch size reflect large changes in training, while larger batches reduce changes.
Another improvement to batching is minibatching, which is when we break up the batches into smaller batches. These smaller and also random batches have been shown to increase data variance further, which is a good thing. This in turns leads to better generalization and, of course, better training.
There is also a third option—or should we say, the original option. Originally, data was just batched, and the method was called batch gradient descent. The major problem with this was that the batches were always the same. This reduced the data variance, which, as we now know, led to decreased training performance and learning. Batch gradient descent is an option, but not one you will choose very often.
Batch Normalization and Regularization
Batch normalizing is an additional process we may perform as the inputs flow through the network. Normalizing the data in a batch or after it processes through a layer allows for more stable networks by avoiding vanishing and exploding gradients. Regularization is the same process, but it typically involves balancing internal network weights using the L1 or L2 norm. We use the term norm to refer to a normalization of the vector space, or as performed in linear algebra, normalizing a vector. The L1 or L2 refers to the distance used to calculate the vector’s magnitude. In calculating the L1 norm, we use what is referred to as the taxi cab or block distance, while the L2 norm is the more typical euclidean distance. An example of calculating the L1 and L2 norm is shown in Equation 15. Notice the subtle but important difference between the two calculations.
Equation 15.
$${\left\rightX\left\right}_{1}=\left3\right+\left4\right=7$$Normalization and regularization can be important ways to optimize deep learning, as we will see when we start building networks.
Activation Functions
We already covered the absence of an activation function or just straight linear output of a network. We also looked at the step activation function, which essentially steps the output and is very useful in classification problems. We will use a variety of activation functions that are specific to regression or classification. However, in many cases we may use broader functions to work between hidden layers of networks that will work on either problem. Much like optimizers, there are a variety of activation functions that are more useful for certain problems and data types. There is often no hardandfast rule for which to use, and a lot of your experience working with these functions will come through hard work. Another option is to digest several papers and take recommendations from those. While that can work, and is quite useful anyway, you often have to be careful that their problems and network design align well with your own problem. The following are the more common activation functions you may come across and will likely use in this book:
 Linear

Essentially the absence of an activation function. The output from summation is sent directly to output. This, as we’ve seen, is a perfect function for regression problems.
 Step

We’ve seen a basic implementation of a step function in Example 13. Use this one for classification problems.
 Sigmoid or logistic activation

This is also called the squishification function because it squishes the output to a value between 0.0 and 1.0. It was the first common activation function because it was so easy to differentiate when calculating backpropagation by hand. Figure 113 shows the sigmoid function and how it resembles logistic regression. This method is used for classification.
 Tanh or hyperbolic tangent function

This squishes the output to between –1 and +1 and is most effective for classification problems.
 ReLU (rectified linear unit)

This is a combination of the step and linear functions. This function is used for regression and is quite often in between hidden layers.
 Leaky ReLU

Exactly like ReLU, but with a leaky step function. This function is quite effective in controlling vanishing or exploding gradients. Leaky ReLU works best between layers.
 Parametric ReLU (PReLU)

This is a leaky ReLU function that provides further control with parameters. Again, it is best used between layers, but it also works for regression.
 ELU (exponential linear unit)

This provides an exponential rather than a linear response. This method works for regression problems and between layers.
 Softmax

This is for classification problems where the output is represented by a probability vector that denotes how well an output ranging from 0.0 to 1.0 fits within a set of classes. The total sum of the output of all classes equals 1.0, or 100%. If we go back to the Perceptron Game and review the classification problems, we needed two output neurons to denote our separate classes. Softmax would allow us to reduce our output to one neuron that can output a vector of the classes and probabilities of being within each class.
The sample set of activation functions in Figure 113 has also almost doubled in just under a decade. If you are training deep learning networks, you need to keep current with the best activation and optimization functions and so on. Fortunately, when using AI (deep learning) services provided by Google, the cloud services manage most of that or, as we will see, provide help along the way.
Loss Functions
As we saw when we talked about goodness of fit earlier in this chapter, our methods (or what we call models) need a way to determine the amount of error. We may also use the terms loss or cost to denote the total amount of error. Recall that our goal is to minimize this error, loss, or cost by using some variation of gradient descent optimization. As we’ve seen with MSE, loss or error functions also are differentiated by the goal of the network, be it regression or classification. Below is a quick preview of loss functions you will likely encounter in this book, or as you explore deep learning solutions on your own:
 MSE (mean squared error)

We covered this method earlier when we looked at regression. MSE represents the mean error distance squared.
 RMSE (root mean squared error)

This is the square root of MSE. This variation is useful when trying to better understand the variance of your model.
 MSLE (mean squared logarithmic error)

This denotes the MSE on a logarithmic scale. This method is useful for large ranges of numbers—that is, when values range from zero to billions or more.
 MAE (mean absolute error)

This measures the error distance between two variables or features. Thus, in a 2D (x, y) regression plot, error would be measured on the x and yaxes, both vertically and horizontally. In MSE, the measure of error is only the vertical difference on the yaxis.
 Binary classification functions

There is a whole list of base error functions that measure classification on outputs. They determine the amount of error an input is within a class (1) or not within a class (0), which works well for binary problems.
 Binary crossentropy loss

With classification problems, it works better mathematically to classify in terms of probability within a class rather than to just use binary classification as above. That means this becomes the preferred method, and one we will discuss at length when we get to those later chapters.
 Hinge loss

This is a binary classification loss function similar to crossentropy. It differs in that it allows classification to range in values from [–1,1]. Standard crossentropy uses values in the range [0,1].
 Squared hinge loss
 Multiclass classifier

This is useful for when you want to class an input into multiple classes. For example, a picture of a dog fed into a network could be identified as a dog, and perhaps a specific dog breed and color.
 Multiclass crossentropy loss

This is the same approach we use in binary crossentropy, except it’s used for multiple class problems. It is the preferred and standard approach.
 Spare multiclass crossentropy loss

This deals with the problem of identifying data over a large number of classes. Datasets as large as 11,000 classes have been released in just the last year. Even with 18 million images fed into a classifier, with that many classes, that still only leaves about 1,600 images per class.
 Kullback–Leibler divergence loss (KL divergence)

This is an advanced function that determines the amount of error between distributions of data. It is not wellsuited to multiclass classification problems but does well for adversarial training.
Use this list of loss functions as a reference. We will explore the more important loss functions more closely in later examples. In the next section, we look at building a simple multilayer perceptron network.
Building a Deep Learner
We already understand how to build a network, but when doing the backpropagation with automatic differentiation and all the other parts, it really makes more sense to use a library like Keras, TensorFlow, PyTorch, and so on. All of these libraries are available on a local machine, which is sometimes required for data privacy concerns. For this book, we will use the cloud to build all of our networks. However, it can be useful to look at code examples of how deep learning networks are built with other libraries in Python. Example 15 shows an example of a simple classifier network, one that could be used to solve our Perceptron Game 5 problem.
Example 15.
model
=
Sequential
()
model
.
add
(
Dense
(
3
,
input_dim
=
2
,
activation
=
'relu'
))
model
.
add
(
Dense
(
2
,
activation
=
'sigmoid'
))
# compile the keras model
model
.
compile
(
loss
=
'binary_crossentropy'
,
optimizer
=
'adam'
,
metrics
=
[
'accuracy'
])
# fit the keras model on the dataset
model
.
fit
(
X
,
y
,
epochs
=
1000
,
batch_size
=
10
)
The example Keras code shown in Example 15 builds a deep learning network with three input nodes and a second output layer with two output nodes. We refer to perceptrons as nodes here to makes things more generic. Not all nodes in a future network may follow a perceptron. The code starts by denoting a variable called a model of type Sequential. The model denotes the entire deep learning network, which in this case is denoted as Sequential. Sequential here just means continually connected. After that, each layer is added with an add
statement, the first layer being three nodes with an input dimension of 3 and a ReLU activation function. Don’t worry too much about the new activation functions just now—we will cover them later. Next, the output layer is added with a sigmoid function. Then the entire model is compiled, which means it is set up for backpropagation. Finally, the model calls fit
, which means it will iterate through the data for 1,000 epochs or iterations, batching the learning process in groups of 10.
Example 15 shows how accessible and powerful this technology has become. What can be done in six lines of Python code using the Keras library likely took hundreds of lines of code just a few years ago. However, as accessible as this technology is, it still requires an immense amount of data and processing power to be effective. While data can be accessed for free or may be available from your organization, computational processing is often another matter entirely, and this is why we focus on using cloud resources for all of the networks in this book.
Tip
Keras is a great library that can quickly get you programming deep learning models. Be sure to check out the Keras website for more information and tutorials to help you get started.
Fortunately, there is a free tool available from Google that will allow us to set up a multilayer network quickly and train it in minutes—yes, minutes. Open the TensorFlow Playground site, as shown in Figure 114.
As soon as you open that site, you will see there are two inputs denoted X_{1} and X_{2} shown as two shaded boxes. These boxes represent distributions of data. You can think of a distribution as an endless box of data. Each time you reach into the box and pull a sample at random, the value of the sample is determined by the distribution. This is an important concept and is further explained in Figure 115. In the figure, we can see two distributions. If we guess a value of 0.5 (x) and apply it to each distribution, we get a value of 0.5 for uniform and perhaps 1.7 for normal. This is because the data is skewed by the shape of the distribution. This is an important concept to grasp, and it is one we will revisit later.
Note
Being able to understand distributions and probability is fundamentally important to data science, and in turn, deep learning. If you find you lack some knowledge in statistics or probability theory, you should brush up. Again, there are plenty of free materials online.
Getting back to Figure 114 and TensorFlow Playground, we can see that inside the network there are two hidden layers with an input layer of four neurons and an output layer of two neurons. Pressing the Play button on the left will start the network training, and you will see how the network classifies the output as the epochs progress. In the end, the loss is minimized, and the fit looks quite nice. However, at this point and always, we want to understand whether we can optimize the network in some manner.
Optimizing a Deep Learning Network
After we have our inputs flowing through the network and can see that the outputs are training effectively, our next step is always to optimize a network. We want do this step before any data validation as well. Recall that we always want to break our input data into three sets of data for training, testing, and validation. Before doing that, though, there are few simple tricks we can apply to this model or to any network:
 Learning rate = alpha (⍺)

Determine what effect adjusting the learning rate up or down has on the network. Adjust the learning rate to 0.01 and replay the sample. Then adjust the rate to 0.1. Which learned faster?
 Activation function = tanh

Try various activation functions. Tanh and sigmoid work well with classification, while ReLU and linear are applicable to regression.
 Regularization and regularization rate

Regularizing data is a form of normalizing data between layers. We do this to avoid those exploding or vanishing gradients that can happen if a weight gets too large or small.
 Hidden layers

Increase the number of hidden layers, and thus neurons, in the network. Determine the effect this has on the network.
 Neurons

Increase or decrease the number of neurons on each layer of the network. Monitor the training performance of the network and watch for over or underfitting.
Figure 116 shows a network being trained with a modified learning rate. Play with the network and try to optimize it to the fewest neurons while still being able to learn to classify the outputs effectively. Make sure that you do not overfit or underfit to the data.
You are unable to control the loss function aside from setting the problem type as regression or classification. Be sure to switch between the two problem types and see what effect that has on the output as well. In the next section, we look at what can happen if your network design has too few or too many layers or neurons.
Overfitting and Underfitting
One of our primary goals in optimizing a network is to build the smallest and most concise network for the task at hand. It can be very easy to throw endless layers, neurons, and weights at a problem. The problem with this approach is that deep learning networks can actually memorize data—that is, they can learn data so well that they just remember the answer to a specific question rather than generalize an answer. This is the reason we withhold a percentage of data for both test and validation. Typically after optimizing a network to a set of training data, you then evaluate the trained network on the test dataset. If the network predicts comparative results, we say it has generalized to the training data. In some cases, running the test set may generate very bad predictions, and this often indicates the network has been overtrained or overfitted to the data.
Note
Breaking data into training, test, and validation sets provides two phases of confirmation of your model. You can use the test dataset as a firstpass test against your trained model, and the validation set can be used as a secondpass test against the model. In more critical applications, you may have more phases of test/validation data.
Over and underfitting is a critical element to building successful networks, so it is a topic we will revisit over and over again throughout this book. It is easy to see how we can over and underfit using TensorFlow playground. Figure 117 shows the results of over and underfitting the neural network. Add or remove layers and neurons to see if you can create the same over and underfit patterns. You may also have to alter the learning rate, the activation function, or the number of epochs you run. On the bottomleft side of the interface, there are also options to set the ratio of training to test data, as well as noise and batch size.
Network Capacity
The capacity of a network, or the number of neurons and weights in a network, describes its capacity to learn the required data. If your network is small, with only a few layers and neurons, you would not expect such a network to learn a large dataset. Likewise, a network that is too large, with lots of layers and neurons, could become able to memorize the data. Again, this is the reason we break out test and validation datasets to confirm how well a network performs after training.
Hopefully you can now appreciate how simultaneously easy and difficult building deep learning networks can be. On the surface, stacking layers and designing networks is like combining Lego blocks that wrap around complex systems of equations. Building deep learning models takes attention to detail and patience—lots of patience. Fortunately, using the Google Cloud Platform will wrap many of the complex details and provide a performant platform that should reduce training times, thus allowing you to conduct more experiments in the same amount of time.
Conclusion
Deep learning has become the cornerstone of the new wave of AI tech that is sweeping the globe. The thing we need to remind ourselves, though, is that the foundation of this AI is still based on old tech like data science. This means we still need to understand the tenets of data science in order to be successful AI practitioners. That in turn means that understanding the data is also a requirement for anyone looking to be successful. Sadly, this fact is often lost on eager newcomers looking to build cool AI, only to find nothing they try works. Almost always this speaks to a lack of understanding of the fundamentals and the data. Hopefully you can appreciate the importance of data science and keep that in mind as we move into deeper AI. In the next chapter, we will begin exploring AI on the Google Cloud Platform.
Game Answers
Here are the answers for the Perceptron Games 2 and 3. Some of the games may allow for multiple solutions, and this may mean your solution is not listed.
W1  W2  W0 or bias 

1 
2 
0 
W1  W2  W3  W0 or bias 

1 
2 
0 
0 
Game 5
There are multiple solutions for Game 5, and we leave it up to the reader to find them on their own. The solution to Game 5 is not important, however; what is important is that you understand how a fully connected network functions.
Get Practical AI on the Google Cloud Platform now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.