Linear Methods
Linear models for reinforcement learning
In this article, we will continue our function approximation journey by introducing linear models for representing the state value function $V_{\pi}(S_t)$. Recall that in the case of large state spaces it is advantageous to use some sort of parametic funtion approximation rather than a tabular representation of the state value function. Furthermore, we need a model in order to to perform, for examle, the SGD update step given below
$$\mathbf{w}_{t+1} = \mathbf{w}_t + \eta \left[V_{\pi}(S_t)-\hat{V}_{\pi}(S_t,\mathbf{w}_t) \right ] \nabla \hat{V}(S_t, \mathbf{w}_t)$$
One of the simplest representations we can have, and the one we take in this post, is a linear model with resepct to the weights $\mathbf{w}$. In this case, $\hat{V}(s, \mathbf{w})$ becomes
$$\hat{V}(s, \mathbf{w}) = \sum_{i=1}^{d} w_i x_i(s) = \mathbf{w}^T \mathbf{x}(s)$$
The expression above implies that there is a vector $\mathbf{x}(s)$ for every state having the same number of components as the weights vector. The vector $\mathbf{x}(s)$ represents the features of the state $s$. For example for an autonomous vehicle, $\mathbf{x}(s)$ may correspond to the vector with components such as vehicle velocity, vehicle acceleration, vehicle position, vehicle orientatio, gas level e.t.c. Technically, each component $x_i$ represents a function such that $x_i: \mathbb{S}\rightarrow \mathbb{R}$ [1]. For linear methods, features are basis functions because they form a linear basis for the set of approximate functions [1].
Linear models have a straightforward gradient calculation. Indeed in this case
$$\nabla \hat{V}(S_t, \mathbf{w}_t) = \mathbf{x}(s)$$
Therefore, the SGD update step becomes
$$\mathbf{w}_{t+1}= \mathbf{w}_t + \eta \left[V_{\pi}(S_t)-\hat{V}_{\pi}(S_t,\mathbf{w}_t) \right ] \mathbf{x}(S_t)$$
Linear models are, in general, very well understood throughout science. For our case, when using a linear model case there is only one optimum. Therefore, any method that is guaranteed to converge to or near a local optimum is automatically guaranteed to converge to or near the global optimum [1].
As a final note, observe that we need $V_{\pi}(S_t)$ in order to perform SGD weights update. This many not always be available. We may, however, have in hand an approximation of it, let's calle it $U_t$ i.e. $V_{\pi}(S_t) \approx U_t$. In this scenario, we are forced to use the latter. However, if $U_t$ is an unbiasd estimate, then $\mathbf{w}_t$ is guaranteed to converge to a local optimum under the conditions specified above for decreasing $\eta$.
- Richard S. Sutton and Andrew G. Barto,
Reinforcement Learning: An Introduction
.