|
bitrl & cuberl Documentation
Simulation engine for reinforcement learning agents
|
In this example, we will apply another temporal difference learning algorithm namely Q-learning on CliffWorld. This is a Gymnasium-based environment implemented in bitrl::envs::gymnasium::CliffWorld class. We used this environment in example CubeRL Example 13: SARSA on CliffWorld.
Q-learning, just like SARSA, is a model-free reinforcement learning algorithm that learns how good an action is in a given state, without needing a model of the environment. As the name suggests, the algorithms learns a state-action value function \(Q(s,\alpha)\). The algorithm is very similar to SARSA. Before we show the algorithmic steps, here is the update function used in Q-learning
$$ Q(s,\alpha) = Q(s,\alpha) + \eta \big[r_t + \gamma \max_{\alpha_{t+1}} Q(s_{t+1}, \alpha_{t+1}) - Q(s,\alpha)\big] $$
The algorithm just like SARSA looks into the future but always selects the action that provides the best $Q$. Contrast this with the update formula for SARSA:
$$ Q(s_t, \alpha_t) = Q(s_t, \alpha_t) + \eta \left[r_{t+1} + \gamma Q(s_{t+1}, \alpha_{t+1}) - Q(s_t, \alpha_t) \right] $$
So Q-learning instead of consulting \(\pi\) about \(\alpha_{t+1}\) it simply acts greedily and selects the action that provides the best state-action value function. This is called optimistic bootstrapping; assume the best possible action will be taken next, regardless of what we'll actually do. Here are the steps for Q-learning. These are very similar to SARSA:
The first step is to initialize the table that represents $Q(s,\alpha)$ to arbitrary values; often this is just zero. This however, can also be values that encourage exploration.
The algorithm begins by some be presented with a state. Q-learning needs to decide what to do whilst at this state. This is done using a policy $\pi$ which most often will be an \(\epsilon-\)greedy policy.
The algorithm will execute the action that was selected from step 2. The environment will respond with a reward \(r_t\) and the new state \(s_{t+1}\). In order to apply the update formula we need to calculate what is the best \(Q\) at \(s_{t+1}\). The update rule, see above, is using the future \(maxQ(s_{t+1}, \alpha_{t+1})\) in order to update the current \(Q(s_{t}, \alpha_{t})\) Compared to SARSA, in Q-learning we don't necessarily choose \(\alpha_{t+1}\) when at \(s_{t+1}\) but instead ask the policy to provide the action to.
After updating, we move to \(s_{t+1}\) and ask \(\pi\) to provide us with an action and we repeat step 3. If we've reached a terminal state (like the end of a game or a completed transaction), the episode ends and we start fresh. The environment has to inform the agent about whether it reached the end of the game or not. So when we take an action in the environment, we will usually receive not just a reward signal and the next state but also a flag indicating if the end of the game or simulation has been reached.
All in all, Q-learning learns the optimal policy. It is a fast and efficient algorithm that is ideal for batch learning. Q-learning, compared to SARSA, is more optimistic which can be dangerous in risky domains, and it suffers from maximization bias that can overestimate action values when estimates are noisy [2].
The Q-learning algorithm is implemented in the cuberl::rl::algos::td::QLearningSolver class
|
|---|
| Figure 1:Playing CliffWorld step 1. |
|
|---|
| Figure 2:Playing CliffWorld step 2. |
|
|---|
| Figure 3:Playing CliffWorld step 3. |
|
|---|
| Figure 4:Playing CliffWorld step 9. |
|
|---|
| Figure 5:Playing CliffWorld step 12. |
|
|---|
| Figure 6:Playing CliffWorld step 13. |