An overview of gradient descent optimization algorithms.
Gradient descent variants 在平时的应用过程中共有三种主流的梯度下降算法:batch gradient decent、stochastic gradient decent、mini-batch gradient decent. 其中第一种和第三种首先分别计算每一个样本的损失函数梯度之后对它们求平均作为更新依据。 Challenges Choosing a proper learning rate can be difficult. Additionally, the same learning rate applies to all parameter updates. If our data is sparse and our features have very different frequencies, we might not want to update all of them to the same extent, but perform a larger update for rarely occurring features. Another key challenge of minimizing highly non-convex error functions common for neural networks is avoiding getting trapped in their numerous suboptimal local minima. Dauphin et al. [3] argue that the difficulty arises in fact not from local minima but from saddle points, i.e. points where one dimension slopes up and another slopes down. These saddle points are usually surrounded by a plateau of the same error, which makes it notoriously hard for SGD to escape, as ...