In this example we will look at how to solve the multi-armed bandits problem. In particular, we will look at how to use two specific algorithms namely \(\epsilon\)-greedy and Thompson sampling.
The multi-armed bandits framework has a lot of applications to online decision-making such as advertisement placement, and recommendation engines. An algorithm that solves the problem essentially provides a policy that optimizes for a given criteria e.g. reward accumulated over how to choose amongst the available options.
Remark
The article, by James LeDoux, Multi-Armed Bandits in Python: Epsilon Greedy, UCB1, Bayesian UCB, and EXP3 provides a nice overview of $\epsilon$-greedy and Thompson sampling whilst the book Bandit Algorithms for Website Optimization by John Myles White, discusses bandit algorithms for website optimization.
\(\epsilon\)-greedy sampling
Perhaps one of the simplest approaches to solve the bandits problem is an \(\epsilon\)-greedy sampling strategy. As probably the name suggests, the algorithm most of the time will greedily select the action with the highest accumulated reward. However for a small probability, \(\epsilon\), the algorithm will randomly select an action amonst the available ones. This is a very simple algorithm and often provides quite good results.
The \(\epsilon\)-greedy sampling is implemented in the class cuberl::rl::policies::EpsilonGreedyPolicy.
Thompson sampling
According to wikipedia, Thompson sampling is an approach for selecting actions that simultaneously address the exploration–exploitation dilemma in the multi-armed bandit problem. When using Thompson sampling we choose the action that maximizes the expected reward with respect to a randomly drawn belief.
In the implementation below, we will use the Beta distribution but this need not be the case. Feel free to experiment with this.
The class bitrl::envs::bandits::MultiArmedBandits conveniently wraps the multi-armed bandits framework into an environment we can utilise.
Driver code
#include "cubeai/base/cubeai_types.h"
#include "cubeai/maths/vector_math.h"
#include "cubeai/rl/policies/epsilon_greedy_policy.h"
#include <cmath>
#include <utility>
#include <tuple>
#include <iostream>
#include <random>
#include <algorithm>
#include <numeric>
#include <utility>
#include <unordered_map>
{
struct SamplingResult
{
};
SamplingResult
SamplingResult result;
result.total_reward = 0.0;
result.lever_pulls.resize(env.n_actions());
DynVec<real_t> max_rewards(env.n_actions());
for(uint_t a=0; a < env.n_actions(); ++a){
max_rewards[a] = 0.0;
result.lever_pulls[a] = 0;
}
EpsilonGreedyPolicy
policy(eps);
auto time_step =
env.step(action);
auto reward = time_step.reward();
result.total_reward +=
reward;
result.lever_pulls[
action] += 1;
}
return result;
}
SamplingResult
SamplingResult result;
result.total_reward = 0.0;
result.lever_pulls.resize(
env.n_actions());
for(uint_t a=0; a <
env.n_actions(); ++a){
result.lever_pulls[a] = 0;
}
std::vector<BetaDist<real_t>> beta_dists(
env.n_actions(),
BetaDist<real_t>(1.0, 1.0));
std::vector<real_t> samples(
env.n_actions(), 0.0);
for(uint_t i=0; i < beta_dists.size(); ++i){
samples[i] = beta_dists[i].sample();
}
auto time_step =
env.step(max_item);
auto reward = time_step.reward();
result.total_reward +=
reward;
auto alpha = beta_dists[max_item].alpha();
auto beta = beta_dists[max_item].beta();
if(reward ==
env.success_reward()){
alpha += 1.0;
}
else{
beta += 1.0;
}
beta_dists[max_item].reset(alpha, beta);
result.lever_pulls[max_item] += 1;
}
return result;
}
}
{
std::vector<real_t> p(5, 0.5);
std::unordered_map<std::string, std::any> options;
options["p"] = std::any(p);
std::cout<<"Running thompson sampling"<<std::endl;
std::cout<<"Thompson reward: "<<thompson_result.total_reward<<std::endl;
std::cout<<"Thompson selected levers: "<<thompson_result.lever_pulls<<std::endl;
std::cout<<"Running epsilon greedy sampling"<<std::endl;
std::cout<<"epsilon-greedy reward: "<<epsilon_greedy_result.total_reward<<std::endl;
std::cout<<"epsilon-greedy selected levers: "<<epsilon_greedy_result.lever_pulls<<std::endl;
}
std::cout<<"========================================"<<std::endl;
{
std::vector<real_t> p(5, 0.2);
p[0] = 0.1;
p[1] = 0.2;
p[2] = 0.3;
p[3] = 0.4;
p[4] = 0.5;
std::unordered_map<std::string, std::any> options;
options["p"] = std::any(p);
std::cout<<"Running thompson sampling"<<std::endl;
std::cout<<"Thompson reward: "<<thompson_result.total_reward<<std::endl;
std::cout<<"Thompson selected levers: "<<thompson_result.lever_pulls<<std::endl;
std::cout<<"Running epsilon greedy sampling"<<std::endl;
std::cout<<"epsilon-greedy reward: "<<epsilon_greedy_result.total_reward<<std::endl;
std::cout<<"epsilon-greedy selected levers: "<<epsilon_greedy_result.lever_pulls<<std::endl;
}
return 0;
}
Definition multi_armed_bandits.h:54
class BernoulliDist. Wrapper to std::bernoulli_distribution
Definition bernoulli_dist.h:19
The beta distribution is a real-valued distribution which produces values in the range [0,...
Definition beta_dist.h:24
The EpsilonGreedyPolicy class.
Definition epsilon_greedy_policy.h:30
int main()
Definition intro_example_1.cpp:31
int int_t
integer type
Definition bitrl_types.h:33
double real_t
real_t
Definition bitrl_types.h:23
Eigen::RowVectorX< T > DynVec
Dynamically sized row vector.
Definition bitrl_types.h:74
std::size_t uint_t
uint_t
Definition bitrl_types.h:43
Eigen::MatrixX< T > DynMat
Dynamically sized matrix to use around the library.
Definition bitrl_types.h:49
uint_t arg_max(const VectorType &vec)
Returns the index of the element that has the maximum value in the array. Implementation taken from h...
Definition vector_math.h:414
dict action
Definition play.py:41
reward
Definition play.py:44
dict policy
Definition play.py:26
Definition rl_example_2.cpp:19
real_t EPS
Definition rl_example_2.cpp:33
SamplingResult run_thompson_sampling(MultiArmedBandits &env)
Definition rl_example_2.cpp:77
uint_t N_EPISODES
Definition rl_example_2.cpp:32
SamplingResult run_epsilon_greedy(MultiArmedBandits &env, real_t eps)
Definition rl_example_2.cpp:43
DynVec< uint_t > lever_pulls
Definition rl_example_2.cpp:38
real_t total_reward
Definition rl_example_2.cpp:37
Running the driver above produces the following output (this may be different on your machine.
Running thompson sampling
Thompson reward: 484
Thompson selected levers: 306 158 256 250 30
Running epsilon greedy sampling
epsilon-greedy reward: 486
epsilon-greedy selected levers: 840 49 36 34 41
========================================
Running thompson sampling
Thompson reward: 447
Thompson selected levers: 19 9 41 229 702
Running epsilon greedy sampling
epsilon-greedy reward: 140
epsilon-greedy selected levers: 840 36 45 37 42
We can see that when all levers have the same probability of success Thompson sampling and \(\epsilon\)-greedy perform more or less the same, with \(\epsilon\)-greedy having some better performance. Notice also that Thompson sampling somehow spreads the decision-making amongst the equally good levers. However, when the levers have different probability of success, Thompson sampling outperforms \(\epsilon\)-greedy by roughly a factor of 4. Also notice that Thompson sampling favours the last lever as one would expect. In contrast, \(\epsilon\)-greedy seems to favour the lever with the lowest success probability.
Although we did not implement this here, we can decay \(\epsilon\) over time. This makes sense since as the agent spends more time in the game it aquires more knowledge and therefore it does not need to explore as often as it needs in the early stages. Alternatively, we can have the agent acting completely at random at the beginning and the then fully exploit.