These two possible problems can be overcome by:
- Minimizing the use of the sigmoid and tanh activation functions
- Using a momentum-based stochastic gradient descent
- Proper initialization of weights and biases, such as xavier initialization
- Regularization (add regularization loss along with data loss and minimize that)