Reinforcement Learning¶
Reinforcement Learning is a powerful technique for learning when you have access to a simulator. That is, suppose that you have a high fidelity way of predicting the outcome of an experiment. This is perhaps a physics engine, perhaps a chemistry engine, or anything. And you’d like to solve some task within this engine. You can use reinforcement learning for this purpose.
Environments¶

class
Environment
(state_shape, n_actions=None, state_dtype=None, action_shape=None)[source]¶ An environment in which an actor performs actions to accomplish a task.
An environment has a current state, which is represented as either a single NumPy array, or optionally a list of NumPy arrays. When an action is taken, that causes the state to be updated. The environment also computes a reward for each action, and reports when the task has been terminated (meaning that no more actions may be taken).
Two types of actions are supported. For environments with discrete action spaces, the action is an integer specifying the index of the action to perform (out of a fixed list of possible actions). For environments with continuous action spaces, the action is a NumPy array.
Environment objects should be written to support pickle and deepcopy operations. Many algorithms involve creating multiple copies of the Environment, possibly running in different processes or even on different computers.

__init__
(state_shape, n_actions=None, state_dtype=None, action_shape=None)[source]¶ Subclasses should call the superclass constructor in addition to doing their own initialization.
A value should be provided for either n_actions (for discrete action spaces) or action_shape (for continuous action spaces), but not both.
 Parameters
state_shape (tuple or list of tuples) – the shape(s) of the array(s) making up the state
n_actions (int) – the number of discrete actions that can be performed. If the action space is continuous, this should be None.
state_dtype (dtype or list of dtypes) – the type(s) of the array(s) making up the state. If this is None, all arrays are assumed to be float32.
action_shape (tuple) – the shape of the array describing an action. If the action space is discrete, this should be none.

property
state
[source]¶ The current state of the environment, represented as either a NumPy array or list of arrays.
If reset() has not yet been called at least once, this is undefined.

property
terminated
[source]¶ Whether the task has reached its end.
If reset() has not yet been called at least once, this is undefined.

property
state_shape
[source]¶ The shape of the arrays that describe a state.
If the state is a single array, this returns a tuple giving the shape of that array. If the state is a list of arrays, this returns a list of tuples where each tuple is the shape of one array.

property
state_dtype
[source]¶ The dtypes of the arrays that describe a state.
If the state is a single array, this returns the dtype of that array. If the state is a list of arrays, this returns a list containing the dtypes of the arrays.

property
n_actions
[source]¶ The number of possible actions that can be performed in this Environment.
If the environment uses a continuous action space, this returns None.

property
action_shape
[source]¶ The expected shape of NumPy arrays representing actions.
If the environment uses a discrete action space, this returns None.

reset
()[source]¶ Initialize the environment in preparation for doing calculations with it.
This must be called before calling step() or querying the state. You can call it again later to reset the environment back to its original state.

step
(action)[source]¶ Take a time step by performing an action.
This causes the “state” and “terminated” properties to be updated.
 Parameters
action (object) – an object describing the action to take
 Returns
the reward earned by taking the action, represented as a floating point number
(higher values are better)


class
GymEnvironment
(name)[source]¶ This is a convenience class for working with environments from OpenAI Gym.

__init__
(name)[source]¶ Create an Environment wrapping the OpenAI Gym environment with a specified name.

reset
()[source]¶ Initialize the environment in preparation for doing calculations with it.
This must be called before calling step() or querying the state. You can call it again later to reset the environment back to its original state.

step
(action)[source]¶ Take a time step by performing an action.
This causes the “state” and “terminated” properties to be updated.
 Parameters
action (object) – an object describing the action to take
 Returns
the reward earned by taking the action, represented as a floating point number
(higher values are better)

Policies¶

class
Policy
(output_names, rnn_initial_states=[])[source]¶ A policy for taking actions within an environment.
A policy is defined by a tf.keras.Model that takes the current state as input and performs the necessary calculations. There are many algorithms for reinforcement learning, and they differ in what values they require a policy to compute. That makes it impossible to define a single interface allowing any policy to be optimized with any algorithm. Instead, this interface just tries to be as flexible and generic as possible. Each algorithm must document what values it expects the model to output.
Special handling is needed for models that include recurrent layers. In that case, the model has its own internal state which the learning algorithm must be able to specify and query. To support this, the Policy must do three things:
The Model must take additional inputs that specify the initial states of all its recurrent layers. These will be appended to the list of arrays specifying the environment state.
The Model must also return the final states of all its recurrent layers as outputs.
The constructor argument rnn_initial_states must be specified to define the states to use for the Model’s recurrent layers at the start of a new rollout.
Policy objects should be written to support pickling. Many algorithms involve creating multiple copies of the Policy, possibly running in different processes or even on different computers.

__init__
(output_names, rnn_initial_states=[])[source]¶ Subclasses should call the superclass constructor in addition to doing their own initialization.
 Parameters
output_names (list of strings) – the names of the Model’s outputs, in order. It is up to each reinforcement learning algorithm to document what outputs it expects policies to compute. Outputs that return the final states of recurrent layers should have the name ‘rnn_state’.
rnn_initial_states (list of NumPy arrays) – the initial states of the Model’s recurrent layers at the start of a new rollout

create_model
(**kwargs)[source]¶ Construct and return a tf.keras.Model that computes the policy.
The inputs to the model consist of the arrays representing the current state of the environment, followed by the initial states for all recurrent layers. Depending on the algorithm being used, other inputs might get passed as well. It is up to each algorithm to document that.
A2C¶

class
A2C
(env, policy, max_rollout_length=20, discount_factor=0.99, advantage_lambda=0.98, value_weight=1.0, entropy_weight=0.01, optimizer=None, model_dir=None, use_hindsight=False)[source]¶ Implements the Advantage ActorCritic (A2C) algorithm for reinforcement learning.
The algorithm is described in Mnih et al, “Asynchronous Methods for Deep Reinforcement Learning” (https://arxiv.org/abs/1602.01783). This class supports environments with both discrete and continuous action spaces. For discrete action spaces, the “action” argument passed to the environment is an integer giving the index of the action to perform. The policy must output a vector called “action_prob” giving the probability of taking each action. For continuous action spaces, the action is an array where each element is chosen independently from a normal distribution. The policy must output two arrays of the same shape: “action_mean” gives the mean value for each element, and “action_std” gives the standard deviation for each element. In either case, the policy must also output a scalar called “value” which is an estimate of the value function for the current state.
The algorithm optimizes all outputs at once using a loss that is the sum of three terms:
The policy loss, which seeks to maximize the discounted reward for each action.
The value loss, which tries to make the value estimate match the actual discounted reward that was attained at each step.
An entropy term to encourage exploration.
This class supports Generalized Advantage Estimation as described in Schulman et al., “HighDimensional Continuous Control Using Generalized Advantage Estimation” (https://arxiv.org/abs/1506.02438). This is a method of trading off bias and variance in the advantage estimate, which can sometimes improve the rate of convergance. Use the advantage_lambda parameter to adjust the tradeoff.
This class supports Hindsight Experience Replay as described in Andrychowicz et al., “Hindsight Experience Replay” (https://arxiv.org/abs/1707.01495). This is a method that can enormously accelerate learning when rewards are very rare. It requires that the environment state contains information about the goal the agent is trying to achieve. Each time it generates a rollout, it processes that rollout twice: once using the actual goal the agent was pursuing while generating it, and again using the final state of that rollout as the goal. This guarantees that half of all rollouts processed will be ones that achieved their goals, and hence received a reward.
To use this feature, specify use_hindsight=True to the constructor. The environment must have a method defined as follows:
 def apply_hindsight(self, states, actions, goal):
… return new_states, rewards
The method receives the list of states generated during the rollout, the action taken for each one, and a new goal state. It should generate a new list of states that are identical to the input ones, except specifying the new goal. It should return that list of states, and the rewards that would have been received for taking the specified actions from those states. The output arrays may be shorter than the input ones, if the modified rollout would have terminated sooner.
Note
Using this class on continuous action spaces requires that tensorflow_probability be installed.

__init__
(env, policy, max_rollout_length=20, discount_factor=0.99, advantage_lambda=0.98, value_weight=1.0, entropy_weight=0.01, optimizer=None, model_dir=None, use_hindsight=False)[source]¶ Create an object for optimizing a policy.
 Parameters
env (Environment) – the Environment to interact with
policy (Policy) – the Policy to optimize. It must have outputs with the names ‘action_prob’ and ‘value’ (for discrete action spaces) or ‘action_mean’, ‘action_std’, and ‘value’ (for continuous action spaces)
max_rollout_length (int) – the maximum length of rollouts to generate
discount_factor (float) – the discount factor to use when computing rewards
advantage_lambda (float) – the parameter for trading bias vs. variance in Generalized Advantage Estimation
value_weight (float) – a scale factor for the value loss term in the loss function
entropy_weight (float) – a scale factor for the entropy term in the loss function
optimizer (Optimizer) – the optimizer to use. If None, a default optimizer is used.
model_dir (str) – the directory in which the model will be saved. If None, a temporary directory will be created.
use_hindsight (bool) – if True, use Hindsight Experience Replay

fit
(total_steps, max_checkpoints_to_keep=5, checkpoint_interval=600, restore=False)[source]¶ Train the policy.
 Parameters
total_steps (int) – the total number of time steps to perform on the environment, across all rollouts on all threads
max_checkpoints_to_keep (int) – the maximum number of checkpoint files to keep. When this number is reached, older files are deleted.
checkpoint_interval (float) – the time interval at which to save checkpoints, measured in seconds
restore (bool) – if True, restore the model from the most recent checkpoint and continue training from there. If False, retrain the model from scratch.

predict
(state, use_saved_states=True, save_states=True)[source]¶ Compute the policy’s output predictions for a state.
If the policy involves recurrent layers, this method can preserve their internal states between calls. Use the use_saved_states and save_states arguments to specify how it should behave.
 Parameters
state (array or list of arrays) – the state of the environment for which to generate predictions
use_saved_states (bool) – if True, the states most recently saved by a previous call to predict() or select_action() will be used as the initial states. If False, the internal states of all recurrent layers will be set to the initial values defined by the policy before computing the predictions.
save_states (bool) – if True, the internal states of all recurrent layers at the end of the calculation will be saved, and any previously saved states will be discarded. If False, the states at the end of the calculation will be discarded, and any previously saved states will be kept.
 Returns
 Return type
the array of action probabilities, and the estimated value function

select_action
(state, deterministic=False, use_saved_states=True, save_states=True)[source]¶ Select an action to perform based on the environment’s state.
If the policy involves recurrent layers, this method can preserve their internal states between calls. Use the use_saved_states and save_states arguments to specify how it should behave.
 Parameters
state (array or list of arrays) – the state of the environment for which to select an action
deterministic (bool) – if True, always return the best action (that is, the one with highest probability). If False, randomly select an action based on the computed probabilities.
use_saved_states (bool) – if True, the states most recently saved by a previous call to predict() or select_action() will be used as the initial states. If False, the internal states of all recurrent layers will be set to the initial values defined by the policy before computing the predictions.
save_states (bool) – if True, the internal states of all recurrent layers at the end of the calculation will be saved, and any previously saved states will be discarded. If False, the states at the end of the calculation will be discarded, and any previously saved states will be kept.
 Returns
 Return type
the index of the selected action

class
A2CLossDiscrete
(value_weight, entropy_weight, action_prob_index, value_index)[source]¶ This class computes the loss function for A2C with discrete action spaces.
PPO¶

class
PPO
(env, policy, max_rollout_length=20, optimization_rollouts=8, optimization_epochs=4, batch_size=64, clipping_width=0.2, discount_factor=0.99, advantage_lambda=0.98, value_weight=1.0, entropy_weight=0.01, optimizer=None, model_dir=None, use_hindsight=False)[source]¶ Implements the Proximal Policy Optimization (PPO) algorithm for reinforcement learning.
The algorithm is described in Schulman et al, “Proximal Policy Optimization Algorithms” (https://openaipublic.s3uswest2.amazonaws.com/blog/201707/ppo/ppoarxiv.pdf). This class requires the policy to output two quantities: a vector giving the probability of taking each action, and an estimate of the value function for the current state. It optimizes both outputs at once using a loss that is the sum of three terms:
The policy loss, which seeks to maximize the discounted reward for each action.
The value loss, which tries to make the value estimate match the actual discounted reward that was attained at each step.
An entropy term to encourage exploration.
This class only supports environments with discrete action spaces, not continuous ones. The “action” argument passed to the environment is an integer, giving the index of the action to perform.
This class supports Generalized Advantage Estimation as described in Schulman et al., “HighDimensional Continuous Control Using Generalized Advantage Estimation” (https://arxiv.org/abs/1506.02438). This is a method of trading off bias and variance in the advantage estimate, which can sometimes improve the rate of convergance. Use the advantage_lambda parameter to adjust the tradeoff.
This class supports Hindsight Experience Replay as described in Andrychowicz et al., “Hindsight Experience Replay” (https://arxiv.org/abs/1707.01495). This is a method that can enormously accelerate learning when rewards are very rare. It requires that the environment state contains information about the goal the agent is trying to achieve. Each time it generates a rollout, it processes that rollout twice: once using the actual goal the agent was pursuing while generating it, and again using the final state of that rollout as the goal. This guarantees that half of all rollouts processed will be ones that achieved their goals, and hence received a reward.
To use this feature, specify use_hindsight=True to the constructor. The environment must have a method defined as follows:
 def apply_hindsight(self, states, actions, goal):
… return new_states, rewards
The method receives the list of states generated during the rollout, the action taken for each one, and a new goal state. It should generate a new list of states that are identical to the input ones, except specifying the new goal. It should return that list of states, and the rewards that would have been received for taking the specified actions from those states. The output arrays may be shorter than the input ones, if the modified rollout would have terminated sooner.

__init__
(env, policy, max_rollout_length=20, optimization_rollouts=8, optimization_epochs=4, batch_size=64, clipping_width=0.2, discount_factor=0.99, advantage_lambda=0.98, value_weight=1.0, entropy_weight=0.01, optimizer=None, model_dir=None, use_hindsight=False)[source]¶ Create an object for optimizing a policy.
 Parameters
env (Environment) – the Environment to interact with
policy (Policy) – the Policy to optimize. It must have outputs with the names ‘action_prob’ and ‘value’, corresponding to the action probabilities and value estimate
max_rollout_length (int) – the maximum length of rollouts to generate
optimization_rollouts (int) – the number of rollouts to generate for each iteration of optimization
optimization_epochs (int) – the number of epochs of optimization to perform within each iteration
batch_size (int) – the batch size to use during optimization. If this is 0, each rollout will be used as a separate batch.
clipping_width (float) – in computing the PPO loss function, the probability ratio is clipped to the range (1clipping_width, 1+clipping_width)
discount_factor (float) – the discount factor to use when computing rewards
advantage_lambda (float) – the parameter for trading bias vs. variance in Generalized Advantage Estimation
value_weight (float) – a scale factor for the value loss term in the loss function
entropy_weight (float) – a scale factor for the entropy term in the loss function
optimizer (Optimizer) – the optimizer to use. If None, a default optimizer is used.
model_dir (str) – the directory in which the model will be saved. If None, a temporary directory will be created.
use_hindsight (bool) – if True, use Hindsight Experience Replay

fit
(total_steps, max_checkpoints_to_keep=5, checkpoint_interval=600, restore=False)[source]¶ Train the policy.
 Parameters
total_steps (int) – the total number of time steps to perform on the environment, across all rollouts on all threads
max_checkpoints_to_keep (int) – the maximum number of checkpoint files to keep. When this number is reached, older files are deleted.
checkpoint_interval (float) – the time interval at which to save checkpoints, measured in seconds
restore (bool) – if True, restore the model from the most recent checkpoint and continue training from there. If False, retrain the model from scratch.

predict
(state, use_saved_states=True, save_states=True)[source]¶ Compute the policy’s output predictions for a state.
If the policy involves recurrent layers, this method can preserve their internal states between calls. Use the use_saved_states and save_states arguments to specify how it should behave.
 Parameters
state (array or list of arrays) – the state of the environment for which to generate predictions
use_saved_states (bool) – if True, the states most recently saved by a previous call to predict() or select_action() will be used as the initial states. If False, the internal states of all recurrent layers will be set to the initial values defined by the policy before computing the predictions.
save_states (bool) – if True, the internal states of all recurrent layers at the end of the calculation will be saved, and any previously saved states will be discarded. If False, the states at the end of the calculation will be discarded, and any previously saved states will be kept.
 Returns
 Return type
the array of action probabilities, and the estimated value function

select_action
(state, deterministic=False, use_saved_states=True, save_states=True)[source]¶ Select an action to perform based on the environment’s state.
If the policy involves recurrent layers, this method can preserve their internal states between calls. Use the use_saved_states and save_states arguments to specify how it should behave.
 Parameters
state (array or list of arrays) – the state of the environment for which to select an action
deterministic (bool) – if True, always return the best action (that is, the one with highest probability). If False, randomly select an action based on the computed probabilities.
use_saved_states (bool) – if True, the states most recently saved by a previous call to predict() or select_action() will be used as the initial states. If False, the internal states of all recurrent layers will be set to the initial values defined by the policy before computing the predictions.
save_states (bool) – if True, the internal states of all recurrent layers at the end of the calculation will be saved, and any previously saved states will be discarded. If False, the states at the end of the calculation will be discarded, and any previously saved states will be kept.
 Returns
 Return type
the index of the selected action

class
PPOLoss
(value_weight, entropy_weight, clipping_width, action_prob_index, value_index)[source]¶ This class computes the loss function for PPO.