bitrl & cuberl Documentation
Simulation engine for reinforcement learning agents
Loading...
Searching...
No Matches
bitrl Namespace Reference

Namespaces

namespace  boards
 
namespace  consts
 
namespace  control
 
namespace  dynamics
 
namespace  envs
 
namespace  estimation
 
namespace  network
 
namespace  rigid_bodies
 
namespace  sensors
 
namespace  utils
 

Classes

struct  ActiveBoundaryObject
 
class  FilteredIterator
 Simple wrapper to boost::filter_iterator. More...
 
struct  IntegralRange
 A range of integer values in [s, e]. More...
 
struct  IsActive
 
struct  NotNull
 
struct  Null
 Null placeholder. More...
 
struct  RealRange
 A range of double precision floating point values. More...
 
class  TimeStep
 Forward declaration. More...
 
struct  TimeStepEnumUtils
 Utilities for TimeStepTp. More...
 
class  VectorTimeStep
 Forward declaration. More...
 

Typedefs

typedef double real_t
 real_t
 
typedef float float_t
 float
 
typedef int int_t
 integer type
 
typedef long int lint_t
 long int type
 
typedef std::size_t uint_t
 uint_t
 
template<typename T >
using DynMat = Eigen::MatrixX< T >
 Dynamically sized matrix to use around the library.
 
template<typename T , uint_t N>
using SquareMat = Eigen::Matrix< T, N, N >
 Square matrix with elements of type T.
 
template<typename T , uint_t N, uint_t M>
using Mat = Eigen::Matrix< T, N, M >
 General fixed size matrix.
 
using RealMat3d = Eigen::Matrix3< real_t >
 Dynamic×3 matrix of type double.
 
using FloatMat3d = Eigen::Matrix3< float_t >
 Dynamic×3 matrix of type float.
 
template<typename T >
using DynVec = Eigen::RowVectorX< T >
 Dynamically sized row vector.
 
using FloatVec = DynVec< float_t >
 single precision floating point vector
 
using RealVec = DynVec< real_t >
 double precision floating point vector
 
using STD_FloatVec = std::vector< float_t >
 single precision std::vector
 
using STD_RealVec = std::vector< real_t >
 double precision std::vector
 
template<typename T >
using ColVec = Eigen::VectorX< T >
 Column vector. Some maths operations are easier using column vectors rather than DynVec.
 
using RealColVec = ColVec< real_t >
 Dynamically sized column vector.
 
using FoatColVec = ColVec< float_t >
 Dynamically sized column vector.
 
using RealColVec3d = Eigen::Vector3d
 3D column vector
 
using FloatColVec3d = Eigen::Vector3f
 3D column vectpr
 

Enumerations

enum class  DeviceType { INVALID_TYPE = 0 , CPU = 1 , GPU = 2 }
 Enumeration of various device types. More...
 
enum class  TimeStepTp : uint_t { FIRST = 0 , MID = 1 , LAST = 2 , INVALID_TYPE = 3 }
 The TimeStepTp enum. More...
 

Functions

std::ostream & operator<< (std::ostream &out, const Null &)
 
template<typename StateTp >
std::ostream & operator<< (std::ostream &out, const TimeStep< StateTp > &step)
 
template<typename T >
std::ostream & operator<< (std::ostream &out, const TimeStep< std::vector< T > > &step)
 
template<typename StateTp >
std::ostream & operator<< (std::ostream &out, const VectorTimeStep< StateTp > &step)
 
template<typename Pred , typename Type >
bool operator== (const FilteredIterator< Pred, Type > &lhs, const FilteredIterator< Pred, Type > &rhs)
 
template<typename Pred , typename Type >
bool operator!= (const FilteredIterator< Pred, Type > &lhs, const FilteredIterator< Pred, Type > &rhs)
 
std::ostream & operator<< (std::ostream &out, const std::chrono::system_clock::time_point &tp)
 
template<typename T >
std::ostream & operator<< (std::ostream &out, const std::vector< T > &obs)
 

Detailed Description

Implements the Gridworld environment from the book Deep Reinforcement Learning in Action by Manning publications. You can find the original environment here: https://github.com/DeepReinforcementLearning/DeepReinforcementLearningInAction

Description

The Acrobot environment is based on Sutton's work in "Generalization in Reinforcement Learning: Successful Examples Using Sparse Coarse Coding" and Sutton and Barto's book. The system consists of two links connected linearly to form a chain, with one end of the chain fixed. The joint between the two links is actuated. The goal is to apply torques on the actuated joint to swing the free end of the linear chain above a given height while starting from the initial state of hanging downwards.

As seen in the Gif: two blue links connected by two green joints. The joint in between the two links is actuated. The goal is to swing the free end of the outer-link to reach the target height (black horizontal line above system) by applying torque on the actuator.

Action Space

The action is discrete, deterministic, and represents the torque applied on the actuated joint between the two links.

Num Action Unit
0 apply -1 torque to the actuated joint torque (N m)
1 apply 0 torque to the actuated joint torque (N m)
2 apply 1 torque to the actuated joint torque (N m)

Observation Space

The observation is a ndarray with shape (6,) that provides information about the two rotational joint angles as well as their angular velocities:

Num Observation Min Max
0 Cosine of theta1 -1 1
1 Sine of theta1 -1 1
2 Cosine of theta2 -1 1
3 Sine of theta2 -1 1
4 Angular velocity of theta1 ~ -12.567 (-4 * pi) ~ 12.567 (4 * pi)
5 Angular velocity of theta2 ~ -28.274 (-9 * pi) ~ 28.274 (9 * pi)

where

  • theta1 is the angle of the first joint, where an angle of 0 indicates the first link is pointing directly downwards.
  • theta2 is relative to the angle of the first link. An angle of 0 corresponds to having the same angle between the two links.

The angular velocities of theta1 and theta2 are bounded at ±4π, and ±9π rad/s respectively. A state of [1, 0, 1, 0, ..., ...] indicates that both links are pointing downwards.

Rewards

The goal is to have the free end reach a designated target height in as few steps as possible, and as such all steps that do not reach the goal incur a reward of -1. Achieving the target height results in termination with a reward of 0. The reward threshold is -100.

Starting State

Each parameter in the underlying state (theta1, theta2, and the two angular velocities) is initialized uniformly between -0.1 and 0.1. This means both links are pointing downwards with some initial stochasticity.

Episode End

The episode ends if one of the following occurs:

  1. Termination: The free end reaches the target height, which is constructed as: -cos(theta1) - cos(theta2 + theta1) > 1.0
  2. Truncation: Episode length is greater than 500 (200 for v0)

Arguments

No additional arguments are currently supported.

env = gym.make('Acrobot-v1')

By default, the dynamics of the acrobot follow those described in Sutton and Barto's book Reinforcement Learning: An Introduction. However, a book_or_nips parameter can be modified to change the pendulum dynamics to those described in the original NeurIPS paper.

# To change the dynamics as described above
env.env.book_or_nips = 'nips'

See the following note and the implementation for details:

‍The dynamics equations were missing some terms in the NIPS paper which are present in the book. R. Sutton confirmed in personal correspondence that the experimental results shown in the paper and the book were generated with the equations shown in the book. However, there is the option to run the domain with the paper equations by setting ‘book_or_nips = 'nips’`

Version History

  • v1: Maximum number of steps increased from 200 to 500. The observation space for v0 provided direct readings of theta1 and theta2 in radians, having a range of [-pi, pi]. The v1 observation space as described here provides the sine and cosine of each angle instead.
  • v0: Initial versions release (1.0.0) (removed from gym for v1)

References

  • Sutton, R. S. (1996). Generalization in Reinforcement Learning: Successful Examples Using Sparse Coarse Coding. In D. Touretzky, M. C. Mozer, & M. Hasselmo (Eds.), Advances in Neural Information Processing Systems (Vol. 8). MIT Press. https://proceedings.neurips.cc/paper/1995/file/8f1d43620bc6bb580df6e80b0dc05c48-Paper.pdf
  • Sutton, R. S., Barto, A. G. (2018 ). Reinforcement Learning: An Introduction. The MIT Press. """ CartPole environment. The original environment is described here: https://github.com/openai/gym/blob/master/gym/envs/classic_control/cartpole.py The state variables are: Observation: Type: Box(4) Num Observation Min Max 0 Cart Position -2.4 2.4 1 Cart Velocity -Inf Inf 2 Pole Angle -0.209 rad (-12 deg) 0.209 rad (12 deg) 3 Pole Angular Velocity -Inf Inf Actions: Type: Discrete(2) Num Action 0 Push cart to the left 1 Push cart to the right Note: The amount the velocity that is reduced or increased is not fixed; it depends on the angle the pole is pointing. This is because the center of gravity of the pole increases the amount of energy needed to move the cart underneath it Reward: Reward is 1 for every step taken, including the termination step Starting State: All observations are assigned a uniform random value in [-0.05..0.05] Episode Termination: Pole Angle is more than 12 degrees. Cart Position is more than 2.4 (center of the cart reaches the edge of the display). Episode length is greater than 200. Solved Requirements: Considered solved when the average return is greater than or equal to 195.0 over 100 consecutive trials. Pendulum environment. The original environment is described here: https://github.com/openai/gym/blob/master/gym/envs/classic_control/pendulum.py The state variables are: @iverbatim ### Description @endiverbatim The inverted pendulum swingup problem is based on the classic problem in control theory. The system consists of a pendulum attached at one end to a fixed point, and the other end being free. The pendulum starts in a random position and the goal is to apply torque on the free end to swing it into an upright position, with its center of gravity right above the fixed point. The diagram below specifies the coordinate system used for the implementation of the pendulum's dynamic equations. <img src="./diagrams/pendulum.png" alt="Pendulum Coordinate System"/> - <tt>x-y</tt>: cartesian coordinates of the pendulum's end in meters. - <tt>theta</tt> : angle in radians. - <tt>tau</tt>: torque in <tt>N m</tt>. Defined as positive <em>counter-clockwise</em>. @subsubsection autotoc_md63 Action Space The action is a <tt>ndarray</tt> with shape <tt>(1,)</tt> representing the torque applied to free end of the pendulum. <table class="markdownTable"> <tr class="markdownTableHead"> <th class="markdownTableHeadNone"> Num

Action

Min

Max

0

Torque

-2.0

2.0

Observation Space

The observation is a ndarray with shape (3,) representing the x-y coordinates of the pendulum's free end and its angular velocity.

Num Observation Min Max
0 x = cos(theta) -1.0 1.0
1 y = sin(theta) -1.0 1.0
2 Angular Velocity -8.0 8.0

Rewards

The reward function is defined as:

r = -(theta2 + 0.1 * theta_dt2 + 0.001 * torque2)*

where $\theta$ is the pendulum's angle normalized between [-pi, pi] (with 0 being in the upright position). Based on the above equation, the minimum reward that can be obtained is -(pi2 + 0.1 * 82 + 0.001 * 22) = -16.2736044*, while the maximum reward is zero (pendulum is upright with zero velocity and no torque applied).

Starting State

The starting state is a random angle in [-pi, pi] and a random angular velocity in [-1,1].

Episode Truncation

The episode truncates at 200 time steps.

Arguments

  • g: acceleration of gravity measured in *(m s-2)* used to calculate the pendulum dynamics. The default value is g = 10.0 .
gym.make('Pendulum-v1', g=9.81)

Version History

v1: Simplify the math equations, no difference in behavior. v0: Initial versions release (1.0.0).

Vector Acrobot environment. This class simply wraps copies of the Acrobot class. See: https://github.com/pockerman/rlenvs_from_cpp/blob/master/src/rlenvs/envs/gymnasium/classic_control/acrobot_env.h for more information

Base class for Gymnasium vector environments. See: https://gymnasium.farama.org/api/vector/sync_vector_env/

BlackJack environment https://github.com/Farama-Foundation/Gymnasium/blob/main/gymnasium/envs/toy_text/blackjack.py

This is a simple implementation of the Gridworld Cliff einforcement learning task.

Description

The board is a 4x12 matrix, with (using NumPy matrix indexing):

  • [3, 0] as the start at bottom-left
  • [3, 11] as the goal at bottom-right
  • [3, 1..10] as the cliff at bottom-center

If the agent steps on the cliff it returns to the start. An episode terminates when the agent reaches the goal.

Actions

There are 4 discrete deterministic actions:

  • 0: move up
  • 1: move right
  • 2: move down
  • 3: move left

Observations

There are 3x12 + 1 possible states. In fact, the agent cannot be at the cliff, nor at the goal (as this results the end of episode). They remain all the positions of the first 3 rows plus the bottom-left cell. The observation is simply the current position encoded as flattened index.

Reward

Each time step incurs -1 reward, and stepping into the cliff incurs -100 reward.

Wrapper to the FrozenLake OpenAI-Gym environment. The origina environment can be found at: https://github.com/openai/gym/blob/master/gym/envs/toy_text/frozen_lake.py Frozen lake involves crossing a frozen lake from Start(S) to goal(G) without falling into any holes(H). The agent may not always move in the intended direction due to the slippery nature of the frozen lake

The agent take a 1-element vector for actions. The action space is (dir), where dir decides direction to move in which can be:

  • 0: LEFT
  • 1: DOWN
  • 2: RIGHT
  • 3: UP

The observation is a value representing the agents current position as current_row * nrows + current_col

Reward schedule:

  • Reach goal(G): +1
  • Reach hole(H): 0

Arguments

gym.make('FrozenLake-v0', desc=None,map_name="4x4", is_slippery=True)

desc: Used to specify custom map for frozen lake. For example, desc=["SFFF", "FHFH", "FFFH", "HFFG"]. map_name: ID to use any of the preloaded maps. "4x4":[ "SFFF", "FHFH", "FFFH", "HFFG" ] "8x8": [ "SFFFFFFF", "FFFFFFFF", "FFFHFFFF", "FFFFFHFF", "FFFHFFFF", "FHHFFFHF", "FHFFHFHF", "FFFHFFFG", ] is_slippery: True/False. If True will move in intended direction with probability of 1/3 else will move in either perpendicular direction with equal probability of 1/3 in both directions. For example, if action is left and is_slippery is True, then:

  • P(move left)=1/3
  • P(move up)=1/3
  • P(move down)=1/3

Typedef Documentation

◆ ColVec

template<typename T >
using bitrl::ColVec = typedef Eigen::VectorX<T>

Column vector. Some maths operations are easier using column vectors rather than DynVec.

◆ DynMat

template<typename T >
using bitrl::DynMat = typedef Eigen::MatrixX<T>

Dynamically sized matrix to use around the library.

◆ DynVec

template<typename T >
using bitrl::DynVec = typedef Eigen::RowVectorX<T>

Dynamically sized row vector.

◆ float_t

typedef float bitrl::float_t

float

◆ FloatColVec3d

using bitrl::FloatColVec3d = typedef Eigen::Vector3f

3D column vectpr

◆ FloatMat3d

using bitrl::FloatMat3d = typedef Eigen::Matrix3<float_t>

Dynamic×3 matrix of type float.

◆ FloatVec

using bitrl::FloatVec = typedef DynVec<float_t>

single precision floating point vector

◆ FoatColVec

using bitrl::FoatColVec = typedef ColVec<float_t>

Dynamically sized column vector.

◆ int_t

typedef int bitrl::int_t

integer type

◆ lint_t

typedef long int bitrl::lint_t

long int type

◆ Mat

template<typename T , uint_t N, uint_t M>
using bitrl::Mat = typedef Eigen::Matrix<T, N, M>

General fixed size matrix.

◆ real_t

typedef double bitrl::real_t

real_t

◆ RealColVec

using bitrl::RealColVec = typedef ColVec<real_t>

Dynamically sized column vector.

◆ RealColVec3d

using bitrl::RealColVec3d = typedef Eigen::Vector3d

3D column vector

◆ RealMat3d

using bitrl::RealMat3d = typedef Eigen::Matrix3<real_t>

Dynamic×3 matrix of type double.

◆ RealVec

using bitrl::RealVec = typedef DynVec<real_t>

double precision floating point vector

◆ SquareMat

template<typename T , uint_t N>
using bitrl::SquareMat = typedef Eigen::Matrix<T, N, N>

Square matrix with elements of type T.

◆ STD_FloatVec

using bitrl::STD_FloatVec = typedef std::vector<float_t>

single precision std::vector

◆ STD_RealVec

using bitrl::STD_RealVec = typedef std::vector<real_t>

double precision std::vector

◆ uint_t

typedef std::size_t bitrl::uint_t

uint_t

Enumeration Type Documentation

◆ DeviceType

enum class bitrl::DeviceType
strong

Enumeration of various device types.

Enumerator
INVALID_TYPE 
CPU 
GPU 

◆ TimeStepTp

enum class bitrl::TimeStepTp : uint_t
strong

The TimeStepTp enum.

Enumerator
FIRST 
MID 
LAST 
INVALID_TYPE 

Function Documentation

◆ operator!=()

template<typename Pred , typename Type >
bool bitrl::operator!= ( const FilteredIterator< Pred, Type > &  lhs,
const FilteredIterator< Pred, Type > &  rhs 
)
inline

◆ operator<<() [1/6]

std::ostream & bitrl::operator<< ( std::ostream &  out,
const Null  
)
inline

◆ operator<<() [2/6]

std::ostream & bitrl::operator<< ( std::ostream &  out,
const std::chrono::system_clock::time_point &  tp 
)
inline

◆ operator<<() [3/6]

template<typename T >
std::ostream & bitrl::operator<< ( std::ostream &  out,
const std::vector< T > &  obs 
)
Template Parameters
TThe type of the value to print
Parameters
outThe stream to write on
obsThe values to print on out
Returns
Read/write reference to the stream

◆ operator<<() [4/6]

template<typename StateTp >
std::ostream & bitrl::operator<< ( std::ostream &  out,
const TimeStep< StateTp > &  step 
)
inline

◆ operator<<() [5/6]

template<typename T >
std::ostream & bitrl::operator<< ( std::ostream &  out,
const TimeStep< std::vector< T > > &  step 
)

◆ operator<<() [6/6]

template<typename StateTp >
std::ostream & bitrl::operator<< ( std::ostream &  out,
const VectorTimeStep< StateTp > &  step 
)
inline

◆ operator==()

template<typename Pred , typename Type >
bool bitrl::operator== ( const FilteredIterator< Pred, Type > &  lhs,
const FilteredIterator< Pred, Type > &  rhs 
)
inline