Let's think back to when we learned about gradient-based optimization for a moment. What happened there? We started at an initial point in the parameter space and then calculated the gradient and took a step toward the local/global minima, then repeated these steps. In gradient descent, with momentum, we used the history of previous updates to guide the next one. If you think about it carefully, this is slightly similar to RNNs and Long Short-Term Memory (LSTM), so we can just replace the entire process of gradient descent with an RNN. This approach is known as learning to learn by gradient descent. The reasoning behind this name is that we train RNNs using gradient descent and then we use the RNN to perform ...
Long Short-Term Memory meta learners
Get Hands-On Mathematics for Deep Learning now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.