|
bitrl & cuberl Documentation
Simulation engine for reinforcement learning agents
|
In this example we will apply the SARSA algorithm on CliffWorld. This is a Gymnasium-based environment implemented in bitrl::envs::gymnasium::CliffWorld class. All Gymnasium-based environments require an instance of the bitrl-envs-api server to run.
Temporal Difference Learning is a combination of dynamic programming and Monte Carlo. TDL does not use the environment transition probabilities and therefore is a model-free algorithm. One problem with TDL is that we don't know which action to take; TDL approximate that state-value function $V(s)$ however, we cannot use the Bellman equations to establish the state-action value function $Q(s, \alpha)$ as we don't know the transition probabilities. The solution thus is to approximate $Q(s, \alpha)$ directly. In this note we will discuss one such approach i.e. SARSA or State-Action-Reward-State-Action.
SARSA is an on-policy method i.e. the agent takes actions and learns, that is it chooses the next $Q(s,\alpha)$, under the policy $\pi$ it currently follows. When we update the value of $Q(s,\alpha)$ we take into account the value $Q(s_{t+1},\alpha_{t+1})$. Hence, with the SARSA algorithm we work with the following tuple $(s_t, \alpha_t, s_{t+1}, \alpha_{t+1})$.
Remark
Typically, the policy $\pi$ will an $\epsilon$-greedy policy but this need not be the case.
The following is the update formula that SARSA is using:
$$ Q(s_t, \alpha_t) = Q(s_t, \alpha_t) + \eta \left[r_{t+1} + \gamma Q(s_{t+1}, \alpha_{t+1}) - Q(s_t, \alpha_t)\right] $$
The state-action value function that is learnt by SARSA reflect a real policy that includes both exploration noise and operational constraints. The algorithm, compared to Q-learning, is more conservative and therefore convergence is slower.
Let's have a closer look into the algorithm.
Reinforcement learning algorithms will usually start by knowing nothing about the environment. So the first step is to initialize the table that represents $Q(s,\alpha)$ to arbitrary values; often this is just zero. This however, can also be values that encourage exploration.
The algorithm begins by some be presented with a state. SARSA needs to decide what to do whilst at this state. This is done using a policy $\pi$ which most often will be an $\epsilon-$greedy policy.
The algorithm will execute the action that was selected from step 2. The environment will respond with a reward $r_t$ and the new state $s_{t+1}$. However, we cannot update yet the table $Q(s,\alpha)$; we need to determine what we will do next. It uses the policy $\pi$ at the new state $s_{t+1}$ in order to select the new action $\alpha_{t+1}$. $\alpha_{t+1}$ is the action that the agent will take next. This is the key SARSA step that distinguishes it from Q-learning. [1]. We can now update the state-action value function. The update rule, see above, is using the future $Q(s_{t+1}, \alpha_{t+1})$ in order to update the current $Q(s_{t}, \alpha_{t})$
After updating, we move to $s_{t+1}$ and take action $\alpha_{t+1}$ and repeat step 3. If we've reached a terminal state (like the end of a game or a completed transaction), the episode ends and we start fresh. The environment has to inform the agent about whether it reached the end of the game or not. So when we take an action in the environment, we will usually receive not just a reward signal and the next state but also a flag indicating if the end of the game or simulation has been reached.
The SARSA algorithm is implemented in the cuberl::rl::algos::td::SarsaSolver class
Below is the driver code for this example.
|
|---|
| Figure 1:Playing CliffWorld step 1. |
|
|---|
| Figure 2:Playing CliffWorld step 3. |
|
|---|
| Figure 3:Playing CliffWorld step 5. |
|
|---|
| Figure 4:Playing CliffWorld step 11. |
|
|---|
| Figure 5:Playing CliffWorld step 14. |
|
|---|
| Figure 6:Playing CliffWorld step 17. |