Now, we will look at a small variant of the Adam algorithm called Adamax. Let's recall the equation of the second-order moment in Adam:
As you may have noticed from the preceding equation, we scale the gradients inversely proportional to the norm of the current and past gradients ( norm basically means the square of values):
Instead of having just , can we generalize it to the norm? In general, when we have ...