Stochastic Gradient Descent

Stochastic gradient descent (SGD) methods are among the most widely used of all function approximation methods. Moreover, they are particularly well suited to online reinforcement learning. The article An overview of gradient descent optimization algorithms gives a nice review of gradient descent methods.

Gradient descent methods assume that the function approximation ( in our case this is the approximate value function $\hat{V}(s, \mathbf{w})$) is a differentiable function of $\mathbf{w}$. Furthermore, we assume that this is true for $s \in \mathbb{S}$

Gradient-descent methods are iterative algorithms. Thus, we denote with $\mathbf{w}_t$ the weight vector at the $t-th$ iteration. Furtheremore, at each iteration, we observe $S_t \rightarrow V_{\pi}(S_t)$ i.e. the example consists of a state $S_t$ and its true value under the policy. Note that we can choose the state randomly. These states might be successive states from an interaction with the environment [1].

Hence, we have in hand $S_t$ and the corresponding state value i.e. $V_{\pi} (S_t)$. However, the problem we now face is that the number of weights is far less than the number of states i.e. our function approximator has a rather limited resolution. In particular, there is generally no $\mathbf{w}$ that gets all the states, or even all the examples, exactly correct. A second problem is that the function approximator must generalize to all the other states that have not appeared yet [1].

We assume that states appear in examples with the same distribution, $\mu$, over which we are trying to minimize the $MSVE$ given by:

$$MSVE(\mathbf{w}) = \sum_{s \in S} \mu(s)\left[V_{\pi}(s) - \hat{V}_{\pi}(s, \mathbf{w})\right]^2$$

Stochastic gradient-descent (SGD) methods do this by adjusting the weight vector after each example is visited by a small amount in the direction that would most reduce the error on that example. Namely

$$\mathbf{w}_{t+1} = \mathbf{w}_t - \frac{1}{2}\eta \nabla \left[V_{\pi}(S_t)-\hat{V}_{\pi}(S_t,\mathbf{w}_t) \right ]^2 = \mathbf{w}_t + \eta \left[V_{\pi}(S_t)-\hat{V}_{\pi}(S_t,\mathbf{w}_t) \right ] \nabla \hat{V}(S_t, \mathbf{w}_t)$$

where $\eta$ is a positive step-size parameter also known as learning rate and

$$\nabla \hat{V}(S_t, \mathbf{w}) = \left(\frac{\partial \hat{V}(S, \mathbf{w})}{\partial w_1}, ..., \frac{\partial \hat{V}(S, \mathbf{w})}{\partial w_m}\right)$$

SGD methods are gradient descent methods because. The latter methods update $\mathbf{w}_t$ by a small amount towards the direction that most reduces the error. The error metric, $MSVE$, that we use here, uses as an error indicator the following quantity

$$e^2 = \left[V_{\pi}(s) - \hat{V}_{\pi}(s, \mathbf{w})\right]^2$$

This is reduced most rapidly in the direction of $-\nabla \hat{V}(S_t, \mathbf{w})$.

There are various gradient descent methods, see for example [2]. Gradient descent methods are called stochastic when the update is done on only a single example, which might have been selected stochastically. Over many examples, making small steps, the overall effect is to minimize an average performance measure such as the MSVE. [1]

SGD performs frequent updates with a high variance that cause the objective function to fluctuate heavily. This is shown in the image below.

Figure 1. SGD fluctuation. Image from [2].

This overshooting behavior, in general, complicates convergence to the exact minimum. However, by slowly decreasing the learning rate $\eta$ it has been shown that SGD shows the same convergence behaviour as batch gradient descent and almost certainly converges to a local or the global minimum both for convex and non-convex optimization problems. One other problem one may face with the SGD algorithm is that some bias may be introduced as the algorithm updates the weights on a per weights basis. This can be mitigated by suffling the data after each iteration.

Given that SGD does frequent updates that may exhibit high variance why don't we move in the minimizing direction in one step and therefore completely eliminate the error on the visited example? Altghough many times this can be done, it may not be the right thing to do so. The reason why this is the case, is that the weights are far less than the states. Hence, we should not seek to find a value function that has zero error for all states. Instead, the approximation should balance the errors in the different states [1]. If we completely corrected each example in one step, then we would not find such a balance [1]. Note also that the convergence results for SGD methods assume that the learning rate, $\eta$, decreases over time. Moreover, if it decreases in such a way as to satisfy the following stochastic approximation conditions

$$\sum_{n = 1}^{\infty} \eta_n(\alpha) = \infty ~~ \text{and} ~~ \sum_{n = 1}^{\infty} \eta_{n}^2(\alpha) < \infty $$

then the SGD method is guaranteed to converge to a local optimum [1].

As a final note, observe that we need $V_{\pi}(S_t)$ in order to perform SGD weights update. This many not always be available. We may, however, have in hand an approximation of it, let's calle it $U_t$ i.e. $V_{\pi}(S_t) \approx U_t$. In this scenario, we are forced to use the latter. However, if $U_t$ is an unbiasd estimate, then $\mathbf{w}_t$ is guaranteed to converge to a local optimum under the conditions specified above for decreasing $\eta$ [1].

References

Richard S. Sutton and Andrew G. Barto, Reinforcement Learning: An Introduction.
An overview of gradient descent optimization algorithms