An overview of gradient descent optimization algorithms.

Gradient descent variants

在平时的应用过程中共有三种主流的梯度下降算法：batch gradient decent、stochastic gradient decent、mini-batch gradient decent. 其中第一种和第三种首先分别计算每一个样本的损失函数梯度之后对它们求平均作为更新依据。

Challenges

Choosing a proper learning rate can be difficult.

Additionally, the same learning rate applies to all parameter updates. If our data is sparse and our features have very different frequencies, we might not want to update all of them to the same extent, but perform a larger update for rarely occurring features.

Another key challenge of minimizing highly non-convex error functions common for neural networks is avoiding getting trapped in their numerous suboptimal local minima. Dauphin et al. [3] argue that the difficulty arises in fact not from local minima but from saddle points, i.e. points where one dimension slopes up and another slopes down. These saddle points are usually surrounded by a plateau of the same error, which makes it notoriously hard for SGD to escape, as the gradient is close to zero in all dimensions.

Gradient descent optimization algorithms

动量（Momentum）

SGD has trouble navigating ravines, i.e. areas where the surface curves much more steeply in one dimension than in another [4], which are common around local optima. In these scenarios, SGD oscillates across the slopes of the ravine while only making hesitant progress along the bottom towards the local optimum as in Image 2.

Momentum [5] is a method that helps accelerate SGD in the relevant direction and dampens oscillations as can be seen in Image 3. It does this by adding a fraction $γ$ of the update vector of the past time step to the current update vector:

Note: Some implementations exchange the signs in the equations. The momentum term $γ$ is usually set to 0.9 or a similar value.

Nesterov accelerated gradient(NAG 改进版动量)

However, a ball that rolls down a hill, blindly following the slope, is highly unsatisfactory. We'd like to have a smarter ball, a ball that has a notion of where it is going so that it knows to slow down before the hill slopes up again.

Nesterov accelerated gradient (NAG) [6] is a way to give our momentum term this kind of prescience. We know that we will use our momentum term $γ v_{t - 1}$ to move the parameters $θ$ . Computing $θ - γ v_{t - 1}$ thus gives us an approximation of the next position of the parameters (the gradient is missing for the full update), a rough idea where our parameters are going to be. We can now effectively look ahead by calculating the gradient not w.r.t. to our current parameters $θ$ but w.r.t. the approximate future position of our parameters:

\begin{aligned} \begin{aligned} v_{t} & = γ v_{t - 1} + η \nabla_{θ} J (θ - γ v_{t - 1}) \\ θ & = θ - v_{t} \end{aligned} \end{aligned}

Again, we set the momentum term $γ$ to a value of around 0.9.

Adagrad（自适应学习率，不同参数采用不同学习率）

Adagrad [9] is an algorithm for gradient-based optimization that does just this: It adapts the learning rate to the parameters, performing smaller updates
(i.e. low learning rates) for parameters associated with frequently occurring features, and larger updates (i.e. high learning rates) for parameters associated with infrequent features. For this reason, it is well-suited for dealing with sparse data. Dean et al. [10] have found that Adagrad greatly improved the robustness of SGD and used it for training large-scale neural nets at Google, which -- among other things -- learned to recognize cats in Youtube videos. Moreover, Pennington et al. [11] used Adagrad to train GloVe word embeddings, as infrequent words require much larger updates than frequent ones.

Previously, we performed an update for all parameters $θ$ at once as every parameter $θ_{i}$ used the same learning rate $η$ . As Adagrad uses a different learning rate for every parameter $θ_{i}$ at every time step $t$ , we first show Adagrad's per-parameter update, which we then vectorize. For brevity, we use $g_{t}$ to denote the gradient at time step $t$ . $g_{t, i}$ is then the partial derivative of the objective function w.r.t. to the parameter $θ_{i}$ at time step $t$ :

$g_{t, i} = \nabla_{θ} J (θ_{t, i})$ .

The SGD update for every parameter $θ_{i}$ at each time step $t$ then becomes:

$θ_{t + 1, i} = θ_{t, i} - η \cdot g_{t, i}$ .

In its update rule, Adagrad modifies the general learning rate $η$ at each time step $t$ for every parameter $θ_{i}$ based on the past gradients that have been computed for $θ_{i}$ :

$θ_{t + 1, i} = θ_{t, i} - \frac{η}{\sqrt{G_{t, i i} + ϵ}} \cdot g_{t, i}$ .

$G_{t} \in R^{d \times d}$ here is a diagonal matrix where each diagonal element $i, i$ is the sum of the squares of the gradients w.r.t. $θ_{i}$ up to time step $t$ [12], while $ϵ$ is a smoothing term that avoids division by zero (usually on the order of $1 e - 8$ ). Interestingly, without the square root operation, the algorithm performs much worse.

As $G_{t}$ contains the sum of the squares of the past gradients w.r.t. to all parameters $θ$ along its diagonal, we can now vectorize our implementation by performing a matrix-vector product $⊙$ between $G_{t}$ and $g_{t}$ :

$θ_{t + 1} = θ_{t} - \frac{η}{\sqrt{G_{t} + ϵ}} ⊙ g_{t}$ .

One of Adagrad's main benefits is that it eliminates the need to manually tune the learning rate. Most implementations use a default value of 0.01 and leave it at that.

Adagrad's main weakness is its accumulation of the squared gradients in the denominator: Since every added term is positive, the accumulated sum keeps growing during training. This in turn causes the learning rate to shrink and eventually become infinitesimally small, at which point the algorithm is no longer able to acquire additional knowledge. The following algorithms aim to resolve this flaw.

Adadelta（Adagrad的改进版）

Adadelta [13] is an extension of Adagrad that seeks to reduce its aggressive, monotonically decreasing learning rate. Instead of accumulating all past squared gradients, Adadelta restricts the window of accumulated past gradients to some fixed size $w$ .

Instead of inefficiently storing $w$ previous squared gradients, the sum of gradients is recursively defined as a decaying average of all past squared gradients. The running average $E [g^{2}]_{t}$ at time step $t$ then depends (as a fraction $γ$ similarly to the Momentum term) only on the previous average and the current gradient:

$E [g^{2}]_{t} = γ E [g^{2}]_{t - 1} + (1 - γ) g_{t}^{2}$ .

We set $γ$ to a similar value as the momentum term, around 0.9. For clarity, we now rewrite our vanilla SGD update in terms of the parameter update vector $Δ θ_{t}$ :

$\begin{aligned} \begin{aligned} Δ θ_{t} & = - η \cdot g_{t, i} \\ θ_{t + 1} & = θ_{t} + Δ θ_{t} \end{aligned} \end{aligned}$

The parameter update vector of Adagrad that we derived previously thus takes the form:

$Δ θ_{t} = - \frac{η}{\sqrt{G_{t} + ϵ}} ⊙ g_{t}$ .

We now simply replace the diagonal matrix $G_{t}$ with the decaying average over past squared gradients $E [g^{2}]_{t}$ :

$Δ θ_{t} = - \frac{η}{\sqrt{E [g^{2}]_{t} + ϵ}} g_{t}$ .

As the denominator is just the root mean squared (RMS) error criterion of the gradient, we can replace it with the criterion short-hand:

$Δ θ_{t} = - \frac{η}{R M S [g]_{t}} g_{t}$ .

The authors note that the units in this update (as well as in SGD, Momentum, or Adagrad) do not match, i.e. the update should have the same hypothetical units as the parameter. To realize this, they first define another exponentially decaying average, this time not of squared gradients but of squared parameter updates:

$E [Δ θ^{2}]_{t} = γ E [Δ θ^{2}]_{t - 1} + (1 - γ) Δ θ_{t}^{2}$ .

The root mean squared error of parameter updates is thus:

$R M S [Δ θ]_{t} = \sqrt{E [Δ θ^{2}]_{t} + ϵ}$ .

Since $R M S [Δ θ]_{t}$ is unknown, we approximate it with the RMS of parameter updates until the previous time step. Replacing the learning rate $η$ in the previous update rule with $R M S [Δ θ]_{t - 1}$ finally yields the Adadelta update rule:

$\begin{aligned} \begin{aligned} Δ θ_{t} & = - \frac{R M S [Δ θ]_{t - 1}}{R M S [g]_{t}} g_{t} \\ θ_{t + 1} & = θ_{t} + Δ θ_{t} \end{aligned} \end{aligned}$

With Adadelta, we do not even need to set a default learning rate, as it has been eliminated from the update rule.

RMSprop（Adadelta的简化版，在前几次更新与Adadelta相似）

RMSprop is an unpublished, adaptive learning rate method proposed by Geoff Hinton in Lecture 6e of his Coursera Class.

RMSprop and Adadelta have both been developed independently around the same time stemming from the need to resolve Adagrad's radically diminishing learning rates. RMSprop in fact is identical to the first update vector of Adadelta that we derived above:

$\begin{aligned} \begin{aligned} E [g^{2}]_{t} & = 0.9 E [g^{2}]_{t - 1} + 0.1 g_{t}^{2} \\ θ_{t + 1} & = θ_{t} - \frac{η}{\sqrt{E [g^{2}]_{t} + ϵ}} g_{t} \end{aligned} \end{aligned}$

RMSprop as well divides the learning rate by an exponentially decaying average of squared gradients. Hinton suggests $γ$ to be set to 0.9, while a good default value for the learning rate $η$ is 0.001.

Adam(动量与Adadelta的结合)

Adaptive Moment Estimation (Adam) [14] is another method that computes adaptive learning rates for each parameter. In addition to storing an exponentially decaying average of past squared gradients $v_{t}$ like Adadelta and RMSprop, Adam also keeps an exponentially decaying average of past gradients $m_{t}$ , similar to momentum. Whereas momentum can be seen as a ball running down a slope, Adam behaves like a heavy ball with friction, which thus prefers flat minima in the error surface [15]. We compute the decaying averages of past and past squared gradients $m_{t}$ and $v_{t}$ respectively as follows:

$\begin{aligned} \begin{aligned} m_{t} & = β_{1} m_{t - 1} + (1 - β_{1}) g_{t} \\ v_{t} & = β_{2} v_{t - 1} + (1 - β_{2}) g_{t}^{2} \end{aligned} \end{aligned}$

$m_{t}$ and $v_{t}$ are estimates of the first moment (the mean) and the second moment (the uncentered variance) of the gradients respectively, hence the name of the method. As $m_{t}$ and $v_{t}$ are initialized as vectors of 0's, the authors of Adam observe that they are biased towards zero, especially during the initial time steps, and especially when the decay rates are small (i.e. $β_{1}$ and $β_{2}$ are close to 1).

They counteract these biases by computing bias-corrected first and second moment estimates:

$\begin{aligned} \begin{aligned} {\hat{m}}_{t} & = \frac{m_{t}}{1 - β_{1}^{t}} \\ {\hat{v}}_{t} & = \frac{v_{t}}{1 - β_{2}^{t}} \end{aligned} \end{aligned}$

They then use these to update the parameters just as we have seen in Adadelta and RMSprop, which yields the Adam update rule:

$θ_{t + 1} = θ_{t} - \frac{η}{\sqrt{{\hat{v}}_{t}} + ϵ} {\hat{m}}_{t}$

The authors propose default values of 0.9 for $β_{1}$ , 0.999 for $β_{2}$ , and $10^{- 8}$ for $ϵ$ . They show empirically that Adam works well in practice and compares favorably to other adaptive learning-method algorithms.

AdaMax

The $v_{t}$ factor in the Adam update rule scales the gradient inversely proportionally to the $ℓ_{2}$ norm of the past gradients (via the $v_{t - 1}$ term) and current gradient $| g_{t} |^{2}$ :

$v_{t} = β_{2} v_{t - 1} + (1 - β_{2}) | g_{t} |^{2}$

We can generalize this update to the $ℓ_{p}$ norm. Note that Kingma and Ba also parameterize $β_{2}$ as $β_{2}^{p}$ :

$v_{t} = β_{2}^{p} v_{t - 1} + (1 - β_{2}^{p}) | g_{t} |^{p}$

Norms for large $p$ values generally become numerically unstable, which is why $ℓ_{1}$ and $ℓ_{2}$ norms are most common in practice. However, $ℓ_{\infty}$ also generally exhibits stable behavior. For this reason, the authors propose AdaMax (Kingma and Ba, 2015) and show that $v_{t}$ with $ℓ_{\infty}$ converges to the following more stable value. To avoid confusion with Adam, we use $u_{t}$ to denote the infinity norm-constrained $v_{t}$ :

$\begin{aligned} \begin{aligned} u_{t} & = β_{2}^{\infty} v_{t - 1} + (1 - β_{2}^{\infty}) | g_{t} |^{\infty} \\ = max (β_{2} \cdot v_{t - 1}, | g_{t} |) \end{aligned} \end{aligned}$

We can now plug this into the Adam update equation by replacing $\sqrt{{\hat{v}}_{t}} + ϵ$ with $u_{t}$ to obtain the AdaMax update rule:

$θ_{t + 1} = θ_{t} - \frac{η}{u_{t}} {\hat{m}}_{t}$

Note that as $u_{t}$ relies on the $max$ operation, it is not as suggestible to bias towards zero as $m_{t}$ and $v_{t}$ in Adam, which is why we do not need to compute a bias correction for $u_{t}$ . Good default values are again $η = 0.002$ , $β_{1} = 0.9$ , and $β_{2} = 0.999$ .

Nadam

As we have seen before, Adam can be viewed as a combination of RMSprop and momentum: RMSprop contributes the exponentially decaying average of past squared gradients $v_{t}$ , while momentum accounts for the exponentially decaying average of past gradients $m_{t}$ . We have also seen that Nesterov accelerated gradient (NAG) is superior to vanilla momentum.

Nadam (Nesterov-accelerated Adaptive Moment Estimation) [16] thus combines Adam and NAG. In order to incorporate NAG into Adam, we need to modify its momentum term $m_{t}$ .

First, let us recall the momentum update rule using our current notation :

$\begin{aligned} \begin{aligned} g_{t} & = \nabla_{θ_{t}} J (θ_{t}) \\ m_{t} & = γ m_{t - 1} + η g_{t} \\ θ_{t + 1} & = θ_{t} - m_{t} \end{aligned} \end{aligned}$

where $J$ is our objective function, $γ$ is the momentum decay term, and $η$ is our step size. Expanding the third equation above yields:

$θ_{t + 1} = θ_{t} - (γ m_{t - 1} + η g_{t})$

This demonstrates again that momentum involves taking a step in the direction of the previous momentum vector and a step in the direction of the current gradient.

NAG then allows us to perform a more accurate step in the gradient direction by updating the parameters with the momentum step before computing the gradient. We thus only need to modify the gradient $g_{t}$ to arrive at NAG:

$\begin{aligned} \begin{aligned} g_{t} & = \nabla_{θ_{t}} J (θ_{t} - γ m_{t - 1}) \\ m_{t} & = γ m_{t - 1} + η g_{t} \\ θ_{t + 1} & = θ_{t} - m_{t} \end{aligned} \end{aligned}$

Dozat proposes to modify NAG the following way: Rather than applying the momentum step twice -- one time for updating the gradient $g_{t}$ and a second time for updating the parameters $θ_{t + 1}$ -- we now apply the look-ahead momentum vector directly to update the current parameters:

$\begin{aligned} \begin{aligned} g_{t} & = \nabla_{θ_{t}} J (θ_{t}) \\ m_{t} & = γ m_{t - 1} + η g_{t} \\ θ_{t + 1} & = θ_{t} - (γ m_{t} + η g_{t}) \end{aligned} \end{aligned}$

Notice that rather than utilizing the previous momentum vector $m_{t - 1}$ as in the equation of the expanded momentum update rule above, we now use the current momentum vector $m_{t}$ to look ahead. In order to add Nesterov momentum to Adam, we can thus similarly replace the previous momentum vector with the current momentum vector. First, recall that the Adam update rule is the following (note that we do not need to modify ${\hat{v}}_{t}$ ):

$\begin{aligned} \begin{aligned} m_{t} & = β_{1} m_{t - 1} + (1 - β_{1}) g_{t} \\ {\hat{m}}_{t} & = \frac{m_{t}}{1 - β_{1}^{t}} \\ θ_{t + 1} & = θ_{t} - \frac{η}{\sqrt{{\hat{v}}_{t}} + ϵ} {\hat{m}}_{t} \end{aligned} \end{aligned}$

Expanding the second equation with the definitions of ${\hat{m}}_{t}$ and $m_{t}$ in turn gives us:

$θ_{t + 1} = θ_{t} - \frac{η}{\sqrt{{\hat{v}}_{t}} + ϵ} (\frac{β_{1} m_{t - 1}}{1 - β_{1}^{t}} + \frac{(1 - β_{1}) g_{t}}{1 - β_{1}^{t}})$

Note that $\frac{β_{1} m_{t - 1}}{1 - β_{1}^{t}}$ is just the bias-corrected estimate of the momentum vector of the previous time step. We can thus replace it with ${\hat{m}}_{t - 1}$ :

$θ_{t + 1} = θ_{t} - \frac{η}{\sqrt{{\hat{v}}_{t}} + ϵ} (β_{1} {\hat{m}}_{t - 1} + \frac{(1 - β_{1}) g_{t}}{1 - β_{1}^{t}})$

Note that for simplicity, we ignore that the denominator is $1 - β_{1}^{t}$ and not $1 - β_{1}^{t - 1}$ as we will replace the denominator in the next step anyway. This equation again looks very similar to our expanded momentum update rule above. We can now add Nesterov momentum just as we did previously by simply replacing this bias-corrected estimate of the momentum vector of the previous time step ${\hat{m}}_{t - 1}$ with the bias-corrected estimate of the current momentum vector ${\hat{m}}_{t}$ , which gives us the Nadam update rule:

$θ_{t + 1} = θ_{t} - \frac{η}{\sqrt{{\hat{v}}_{t}} + ϵ} (β_{1} {\hat{m}}_{t} + \frac{(1 - β_{1}) g_{t}}{1 - β_{1}^{t}})$

搜索此博客

errors