Reinforcement Learning

Reinforcement Learning is a powerful technique for learning when you have access to a simulator. That is, suppose that you have a high fidelity way of predicting the outcome of an experiment. This is perhaps a physics engine, perhaps a chemistry engine, or anything. And you’d like to solve some task within this engine. You can use reinforcement learning for this purpose.

Environments

class Environment(state_shape, n_actions=None, state_dtype=None, action_shape=None)[source]

An environment in which an actor performs actions to accomplish a task.

An environment has a current state, which is represented as either a single NumPy array, or optionally a list of NumPy arrays. When an action is taken, that causes the state to be updated. The environment also computes a reward for each action, and reports when the task has been terminated (meaning that no more actions may be taken).

Two types of actions are supported. For environments with discrete action spaces, the action is an integer specifying the index of the action to perform (out of a fixed list of possible actions). For environments with continuous action spaces, the action is a NumPy array.

Environment objects should be written to support pickle and deepcopy operations. Many algorithms involve creating multiple copies of the Environment, possibly running in different processes or even on different computers.

__init__(state_shape, n_actions=None, state_dtype=None, action_shape=None)[source]

Subclasses should call the superclass constructor in addition to doing their own initialization.

A value should be provided for either n_actions (for discrete action spaces) or action_shape (for continuous action spaces), but not both.

Parameters:
  • state_shape (tuple or list of tuples) – the shape(s) of the array(s) making up the state

  • n_actions (int) – the number of discrete actions that can be performed. If the action space is continuous, this should be None.

  • state_dtype (dtype or list of dtypes) – the type(s) of the array(s) making up the state. If this is None, all arrays are assumed to be float32.

  • action_shape (tuple) – the shape of the array describing an action. If the action space is discrete, this should be none.

property state[source]

The current state of the environment, represented as either a NumPy array or list of arrays.

If reset() has not yet been called at least once, this is undefined.

property terminated[source]

Whether the task has reached its end.

If reset() has not yet been called at least once, this is undefined.

property state_shape[source]

The shape of the arrays that describe a state.

If the state is a single array, this returns a tuple giving the shape of that array. If the state is a list of arrays, this returns a list of tuples where each tuple is the shape of one array.

property state_dtype[source]

The dtypes of the arrays that describe a state.

If the state is a single array, this returns the dtype of that array. If the state is a list of arrays, this returns a list containing the dtypes of the arrays.

property n_actions[source]

The number of possible actions that can be performed in this Environment.

If the environment uses a continuous action space, this returns None.

property action_shape[source]

The expected shape of NumPy arrays representing actions.

If the environment uses a discrete action space, this returns None.

reset()[source]

Initialize the environment in preparation for doing calculations with it.

This must be called before calling step() or querying the state. You can call it again later to reset the environment back to its original state.

step(action)[source]

Take a time step by performing an action.

This causes the “state” and “terminated” properties to be updated.

Parameters:

action (object) – an object describing the action to take

Returns:

  • the reward earned by taking the action, represented as a floating point number

  • (higher values are better)

class GymEnvironment(name)[source]

This is a convenience class for working with environments from OpenAI Gym.

__init__(name)[source]

Create an Environment wrapping the OpenAI Gym environment with a specified name.

reset()[source]

Initialize the environment in preparation for doing calculations with it.

This must be called before calling step() or querying the state. You can call it again later to reset the environment back to its original state.

step(action)[source]

Take a time step by performing an action.

This causes the “state” and “terminated” properties to be updated.

Parameters:

action (object) – an object describing the action to take

Returns:

  • the reward earned by taking the action, represented as a floating point number

  • (higher values are better)

Policies

class Policy(output_names, rnn_initial_states=[])[source]

A policy for taking actions within an environment.

A policy is defined by a tf.keras.Model that takes the current state as input and performs the necessary calculations. There are many algorithms for reinforcement learning, and they differ in what values they require a policy to compute. That makes it impossible to define a single interface allowing any policy to be optimized with any algorithm. Instead, this interface just tries to be as flexible and generic as possible. Each algorithm must document what values it expects the model to output.

Special handling is needed for models that include recurrent layers. In that case, the model has its own internal state which the learning algorithm must be able to specify and query. To support this, the Policy must do three things:

  1. The Model must take additional inputs that specify the initial states of

    all its recurrent layers. These will be appended to the list of arrays specifying the environment state.

  2. The Model must also return the final states of all its recurrent layers as

    outputs.

  3. The constructor argument rnn_initial_states must be specified to define

    the states to use for the Model’s recurrent layers at the start of a new rollout.

Policy objects should be written to support pickling. Many algorithms involve creating multiple copies of the Policy, possibly running in different processes or even on different computers.

__init__(output_names, rnn_initial_states=[])[source]

Subclasses should call the superclass constructor in addition to doing their own initialization.

Parameters:
  • output_names (list of strings) – the names of the Model’s outputs, in order. It is up to each reinforcement learning algorithm to document what outputs it expects policies to compute. Outputs that return the final states of recurrent layers should have the name ‘rnn_state’.

  • rnn_initial_states (list of NumPy arrays) – the initial states of the Model’s recurrent layers at the start of a new rollout

create_model(**kwargs)[source]

Construct and return a tf.keras.Model that computes the policy.

The inputs to the model consist of the arrays representing the current state of the environment, followed by the initial states for all recurrent layers. Depending on the algorithm being used, other inputs might get passed as well. It is up to each algorithm to document that.

Tensorflow implementation

A2C

class A2C(env, policy, max_rollout_length=20, discount_factor=0.99, advantage_lambda=0.98, value_weight=1.0, entropy_weight=0.01, optimizer=None, model_dir=None, use_hindsight=False)[source]

Implements the Advantage Actor-Critic (A2C) algorithm for reinforcement learning.

The algorithm is described in Mnih et al, “Asynchronous Methods for Deep Reinforcement Learning” (https://arxiv.org/abs/1602.01783). This class supports environments with both discrete and continuous action spaces. For discrete action spaces, the “action” argument passed to the environment is an integer giving the index of the action to perform. The policy must output a vector called “action_prob” giving the probability of taking each action. For continuous action spaces, the action is an array where each element is chosen independently from a normal distribution. The policy must output two arrays of the same shape: “action_mean” gives the mean value for each element, and “action_std” gives the standard deviation for each element. In either case, the policy must also output a scalar called “value” which is an estimate of the value function for the current state.

The algorithm optimizes all outputs at once using a loss that is the sum of three terms:

  1. The policy loss, which seeks to maximize the discounted reward for each action.

  2. The value loss, which tries to make the value estimate match the actual discounted reward

    that was attained at each step.

  3. An entropy term to encourage exploration.

This class supports Generalized Advantage Estimation as described in Schulman et al., “High-Dimensional Continuous Control Using Generalized Advantage Estimation” (https://arxiv.org/abs/1506.02438). This is a method of trading off bias and variance in the advantage estimate, which can sometimes improve the rate of convergance. Use the advantage_lambda parameter to adjust the tradeoff.

This class supports Hindsight Experience Replay as described in Andrychowicz et al., “Hindsight Experience Replay” (https://arxiv.org/abs/1707.01495). This is a method that can enormously accelerate learning when rewards are very rare. It requires that the environment state contains information about the goal the agent is trying to achieve. Each time it generates a rollout, it processes that rollout twice: once using the actual goal the agent was pursuing while generating it, and again using the final state of that rollout as the goal. This guarantees that half of all rollouts processed will be ones that achieved their goals, and hence received a reward.

To use this feature, specify use_hindsight=True to the constructor. The environment must have a method defined as follows:

def apply_hindsight(self, states, actions, goal):

… return new_states, rewards

The method receives the list of states generated during the rollout, the action taken for each one, and a new goal state. It should generate a new list of states that are identical to the input ones, except specifying the new goal. It should return that list of states, and the rewards that would have been received for taking the specified actions from those states. The output arrays may be shorter than the input ones, if the modified rollout would have terminated sooner.

Note

Using this class on continuous action spaces requires that tensorflow_probability be installed.

__init__(env, policy, max_rollout_length=20, discount_factor=0.99, advantage_lambda=0.98, value_weight=1.0, entropy_weight=0.01, optimizer=None, model_dir=None, use_hindsight=False)[source]

Create an object for optimizing a policy.

Parameters:
  • env (Environment) – the Environment to interact with

  • policy (Policy) – the Policy to optimize. It must have outputs with the names ‘action_prob’ and ‘value’ (for discrete action spaces) or ‘action_mean’, ‘action_std’, and ‘value’ (for continuous action spaces)

  • max_rollout_length (int) – the maximum length of rollouts to generate

  • discount_factor (float) – the discount factor to use when computing rewards

  • advantage_lambda (float) – the parameter for trading bias vs. variance in Generalized Advantage Estimation

  • value_weight (float) – a scale factor for the value loss term in the loss function

  • entropy_weight (float) – a scale factor for the entropy term in the loss function

  • optimizer (Optimizer) – the optimizer to use. If None, a default optimizer is used.

  • model_dir (str) – the directory in which the model will be saved. If None, a temporary directory will be created.

  • use_hindsight (bool) – if True, use Hindsight Experience Replay

fit(total_steps, max_checkpoints_to_keep=5, checkpoint_interval=600, restore=False)[source]

Train the policy.

Parameters:
  • total_steps (int) – the total number of time steps to perform on the environment, across all rollouts on all threads

  • max_checkpoints_to_keep (int) – the maximum number of checkpoint files to keep. When this number is reached, older files are deleted.

  • checkpoint_interval (float) – the time interval at which to save checkpoints, measured in seconds

  • restore (bool) – if True, restore the model from the most recent checkpoint and continue training from there. If False, retrain the model from scratch.

predict(state, use_saved_states=True, save_states=True)[source]

Compute the policy’s output predictions for a state.

If the policy involves recurrent layers, this method can preserve their internal states between calls. Use the use_saved_states and save_states arguments to specify how it should behave.

Parameters:
  • state (array or list of arrays) – the state of the environment for which to generate predictions

  • use_saved_states (bool) – if True, the states most recently saved by a previous call to predict() or select_action() will be used as the initial states. If False, the internal states of all recurrent layers will be set to the initial values defined by the policy before computing the predictions.

  • save_states (bool) – if True, the internal states of all recurrent layers at the end of the calculation will be saved, and any previously saved states will be discarded. If False, the states at the end of the calculation will be discarded, and any previously saved states will be kept.

Return type:

the array of action probabilities, and the estimated value function

select_action(state, deterministic=False, use_saved_states=True, save_states=True)[source]

Select an action to perform based on the environment’s state.

If the policy involves recurrent layers, this method can preserve their internal states between calls. Use the use_saved_states and save_states arguments to specify how it should behave.

Parameters:
  • state (array or list of arrays) – the state of the environment for which to select an action

  • deterministic (bool) – if True, always return the best action (that is, the one with highest probability). If False, randomly select an action based on the computed probabilities.

  • use_saved_states (bool) – if True, the states most recently saved by a previous call to predict() or select_action() will be used as the initial states. If False, the internal states of all recurrent layers will be set to the initial values defined by the policy before computing the predictions.

  • save_states (bool) – if True, the internal states of all recurrent layers at the end of the calculation will be saved, and any previously saved states will be discarded. If False, the states at the end of the calculation will be discarded, and any previously saved states will be kept.

Return type:

the index of the selected action

restore()[source]

Reload the model parameters from the most recent checkpoint file.

class A2CLossDiscrete(value_weight, entropy_weight, action_prob_index, value_index)[source]

This class computes the loss function for A2C with discrete action spaces.

__init__(value_weight, entropy_weight, action_prob_index, value_index)[source]

PPO

class PPO(env, policy, max_rollout_length=20, optimization_rollouts=8, optimization_epochs=4, batch_size=64, clipping_width=0.2, discount_factor=0.99, advantage_lambda=0.98, value_weight=1.0, entropy_weight=0.01, optimizer=None, model_dir=None, use_hindsight=False)[source]

Implements the Proximal Policy Optimization (PPO) algorithm for reinforcement learning.

The algorithm is described in Schulman et al, “Proximal Policy Optimization Algorithms” (https://openai-public.s3-us-west-2.amazonaws.com/blog/2017-07/ppo/ppo-arxiv.pdf). This class requires the policy to output two quantities: a vector giving the probability of taking each action, and an estimate of the value function for the current state. It optimizes both outputs at once using a loss that is the sum of three terms:

  1. The policy loss, which seeks to maximize the discounted reward for each action.

  2. The value loss, which tries to make the value estimate match the actual discounted reward

    that was attained at each step.

  3. An entropy term to encourage exploration.

This class only supports environments with discrete action spaces, not continuous ones. The “action” argument passed to the environment is an integer, giving the index of the action to perform.

This class supports Generalized Advantage Estimation as described in Schulman et al., “High-Dimensional Continuous Control Using Generalized Advantage Estimation” (https://arxiv.org/abs/1506.02438). This is a method of trading off bias and variance in the advantage estimate, which can sometimes improve the rate of convergance. Use the advantage_lambda parameter to adjust the tradeoff.

This class supports Hindsight Experience Replay as described in Andrychowicz et al., “Hindsight Experience Replay” (https://arxiv.org/abs/1707.01495). This is a method that can enormously accelerate learning when rewards are very rare. It requires that the environment state contains information about the goal the agent is trying to achieve. Each time it generates a rollout, it processes that rollout twice: once using the actual goal the agent was pursuing while generating it, and again using the final state of that rollout as the goal. This guarantees that half of all rollouts processed will be ones that achieved their goals, and hence received a reward.

To use this feature, specify use_hindsight=True to the constructor. The environment must have a method defined as follows:

def apply_hindsight(self, states, actions, goal):

… return new_states, rewards

The method receives the list of states generated during the rollout, the action taken for each one, and a new goal state. It should generate a new list of states that are identical to the input ones, except specifying the new goal. It should return that list of states, and the rewards that would have been received for taking the specified actions from those states. The output arrays may be shorter than the input ones, if the modified rollout would have terminated sooner.

__init__(env, policy, max_rollout_length=20, optimization_rollouts=8, optimization_epochs=4, batch_size=64, clipping_width=0.2, discount_factor=0.99, advantage_lambda=0.98, value_weight=1.0, entropy_weight=0.01, optimizer=None, model_dir=None, use_hindsight=False)[source]

Create an object for optimizing a policy.

Parameters:
  • env (Environment) – the Environment to interact with

  • policy (Policy) – the Policy to optimize. It must have outputs with the names ‘action_prob’ and ‘value’, corresponding to the action probabilities and value estimate

  • max_rollout_length (int) – the maximum length of rollouts to generate

  • optimization_rollouts (int) – the number of rollouts to generate for each iteration of optimization

  • optimization_epochs (int) – the number of epochs of optimization to perform within each iteration

  • batch_size (int) – the batch size to use during optimization. If this is 0, each rollout will be used as a separate batch.

  • clipping_width (float) – in computing the PPO loss function, the probability ratio is clipped to the range (1-clipping_width, 1+clipping_width)

  • discount_factor (float) – the discount factor to use when computing rewards

  • advantage_lambda (float) – the parameter for trading bias vs. variance in Generalized Advantage Estimation

  • value_weight (float) – a scale factor for the value loss term in the loss function

  • entropy_weight (float) – a scale factor for the entropy term in the loss function

  • optimizer (Optimizer) – the optimizer to use. If None, a default optimizer is used.

  • model_dir (str) – the directory in which the model will be saved. If None, a temporary directory will be created.

  • use_hindsight (bool) – if True, use Hindsight Experience Replay

fit(total_steps, max_checkpoints_to_keep=5, checkpoint_interval=600, restore=False)[source]

Train the policy.

Parameters:
  • total_steps (int) – the total number of time steps to perform on the environment, across all rollouts on all threads

  • max_checkpoints_to_keep (int) – the maximum number of checkpoint files to keep. When this number is reached, older files are deleted.

  • checkpoint_interval (float) – the time interval at which to save checkpoints, measured in seconds

  • restore (bool) – if True, restore the model from the most recent checkpoint and continue training from there. If False, retrain the model from scratch.

predict(state, use_saved_states=True, save_states=True)[source]

Compute the policy’s output predictions for a state.

If the policy involves recurrent layers, this method can preserve their internal states between calls. Use the use_saved_states and save_states arguments to specify how it should behave.

Parameters:
  • state (array or list of arrays) – the state of the environment for which to generate predictions

  • use_saved_states (bool) – if True, the states most recently saved by a previous call to predict() or select_action() will be used as the initial states. If False, the internal states of all recurrent layers will be set to the initial values defined by the policy before computing the predictions.

  • save_states (bool) – if True, the internal states of all recurrent layers at the end of the calculation will be saved, and any previously saved states will be discarded. If False, the states at the end of the calculation will be discarded, and any previously saved states will be kept.

Return type:

the array of action probabilities, and the estimated value function

select_action(state, deterministic=False, use_saved_states=True, save_states=True)[source]

Select an action to perform based on the environment’s state.

If the policy involves recurrent layers, this method can preserve their internal states between calls. Use the use_saved_states and save_states arguments to specify how it should behave.

Parameters:
  • state (array or list of arrays) – the state of the environment for which to select an action

  • deterministic (bool) – if True, always return the best action (that is, the one with highest probability). If False, randomly select an action based on the computed probabilities.

  • use_saved_states (bool) – if True, the states most recently saved by a previous call to predict() or select_action() will be used as the initial states. If False, the internal states of all recurrent layers will be set to the initial values defined by the policy before computing the predictions.

  • save_states (bool) – if True, the internal states of all recurrent layers at the end of the calculation will be saved, and any previously saved states will be discarded. If False, the states at the end of the calculation will be discarded, and any previously saved states will be kept.

Return type:

the index of the selected action

restore()[source]

Reload the model parameters from the most recent checkpoint file.

class PPOLoss(value_weight, entropy_weight, clipping_width, action_prob_index, value_index)[source]

This class computes the loss function for PPO.

__init__(value_weight, entropy_weight, clipping_width, action_prob_index, value_index)[source]

Torch implementation

class A2CLossDiscrete(value_weight: float, entropy_weight: float, action_prob_index: int, value_index: int)[source]

This class computes the loss function for A2C with discrete action spaces. The A2C algorithm optimizes all outputs at once using a loss that is the sum of three terms:

  1. The policy loss, which seeks to maximize the discounted reward for each action.

  2. The value loss, which tries to make the value estimate match the actual discounted reward that was attained at each step.

  3. An entropy term to encourage exploration.

Example

>>> import deepchem as dc
>>> import numpy as np
>>> import torch
>>> import torch.nn.functional as F
>>> outputs = [torch.tensor([[0.02, 0.02, 0.02, 0.02, 0.02, 0.02, 0.02, 0.02, 0.02, 0.02, 0.02, 0.02, 0.02, 0.02, 0.02, 0.02, 0.02, 0.02,0.02, 0.02, 0.02, 0.02, 0.02, 0.02, 0.02, 0.02, 0.02, 0.02, 0.02, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2]]), torch.tensor([0.], requires_grad = True)]
>>> labels = np.array([[0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]], dtype = np.float32)
>>> discount = np.array([-1.0203744, -0.02058018, 0.98931295, 2.009407, 1.019603, 0.01980097, -0.9901, 0.01, -1. , 0. ], dtype=np.float32)
>>> advantage = np.array([-1.0203744 ,-0.02058018, 0.98931295, 2.009407, 1.019603, 0.01980097, -0.9901 ,0.01 ,-1. , 0.], dtype = np.float32)
>>> loss = A2CLossDiscrete(value_weight = 1.0, entropy_weight = 0.01, action_prob_index = 0, value_index = 1)
>>> loss_val = loss(outputs, [labels], [discount, advantage])
>>> loss_val
tensor(1.2541, grad_fn=<SubBackward0>)
__init__(value_weight: float, entropy_weight: float, action_prob_index: int, value_index: int)[source]

Computes the loss function for the A2C algorithm with discrete action spaces.

Parameters:
  • value_weight (float) – a scale factor for the value loss term in the loss function

  • entropy_weight (float) – a scale factor for the entropy term in the loss function

  • action_prob_index (int) – Index of the action probabilities in the model’s outputs.

  • value_index (int) – Index of the value estimate in the model’s outputs.

class A2CLossContinuous(value_weight: float, entropy_weight: float, mean_index: int, std_index: int, value_index: int)[source]

This class computes the loss function for A2C with continuous action spaces.

Example

>>> import deepchem as dc
>>> import numpy as np
>>> import torch
>>> import torch.nn.functional as F
>>> outputs = [torch.tensor([[0.], [0.], [0.], [0.], [0.], [0.], [0.], [0.], [0.], [0.]], dtype=torch.float32, requires_grad=True), torch.tensor([10.], dtype=torch.float32, requires_grad=True), torch.tensor([[27.717865],[28.860144]], dtype=torch.float32, requires_grad=True)]
>>> labels = np.array([[-4.897339 ], [ 3.4308329], [-4.527725 ], [-7.3000813], [-1.9869075], [20.664988 ], [-8.448957 ], [10.580486 ], [10.017258 ], [17.884674 ]], dtype=np.float32)
>>> discount = np.array([4.897339, -8.328172, 7.958559, 2.772356, -5.313174, -22.651896, 29.113945, -19.029444, 0.56322646, -7.867417], dtype=np.float32)
>>> advantage = np.array([-5.681633, -20.57494, -1.4520378, -9.348538, -18.378199, -33.907513, 25.572464, -32.485718 , -6.412546, -15.034998], dtype=np.float32)
>>> loss = A2CLossContinuous(value_weight = 1.0, entropy_weight = 0.01, mean_index = 0, std_index = 1, value_index = 2)
>>> loss_val = loss(outputs, [labels], [discount, advantage])
>>> loss_val
tensor(1050.2310, grad_fn=<SubBackward0>)
__init__(value_weight: float, entropy_weight: float, mean_index: int, std_index: int, value_index: int)[source]

Computes the loss function for the A2C algorithm with continuous action spaces.

Parameters:
  • value_weight (float) – a scale factor for the value loss term in the loss function

  • entropy_weight (float) – a scale factor for the entropy term in the loss function

  • mean_index (int) – Index of the mean of the action distribution in the model’s outputs.

  • std_index (int) – Index of the standard deviation of the action distribution in the model’s outputs.

  • value_index (int) – Index of the value estimate in the model’s outputs.

A2C

class A2C(env: Environment, policy: Policy, max_rollout_length: int = 20, discount_factor: float = 0.99, advantage_lambda: float = 0.98, value_weight: float = 1.0, entropy_weight: float = 0.01, optimizer: Optimizer | None = None, model_dir: str | None = None, use_hindsight: bool = False, device: device | None = None)[source]

Implements the Advantage Actor-Critic (A2C) algorithm for reinforcement learning. The algorithm is described in Mnih et al, “Asynchronous Methods for Deep Reinforcement Learning” (https://arxiv.org/abs/1602.01783). This class supports environments with both discrete and continuous action spaces. For discrete action spaces, the “action” argument passed to the environment is an integer giving the index of the action to perform. The policy must output a vector called “action_prob” giving the probability of taking each action. For continuous action spaces, the action is an array where each element is chosen independently from a normal distribution. The policy must output two arrays of the same shape: “action_mean” gives the mean value for each element, and “action_std” gives the standard deviation for each element. In either case, the policy must also output a scalar called “value” which is an estimate of the value function for the current state. The algorithm optimizes all outputs at once using a loss that is the sum of three terms:

  1. The policy loss, which seeks to maximize the discounted reward for each action.

  2. The value loss, which tries to make the value estimate match the actual discounted reward

    that was attained at each step.

  3. An entropy term to encourage exploration.

This class supports Generalized Advantage Estimation as described in Schulman et al., “High-Dimensional Continuous Control Using Generalized Advantage Estimation” (https://arxiv.org/abs/1506.02438). This is a method of trading off bias and variance in the advantage estimate, which can sometimes improve the rate of convergance. Use the advantage_lambda parameter to adjust the tradeoff. This class supports Hindsight Experience Replay as described in Andrychowicz et al., “Hindsight Experience Replay” (https://arxiv.org/abs/1707.01495). This is a method that can enormously accelerate learning when rewards are very rare. It requires that the environment state contains information about the goal the agent is trying to achieve. Each time it generates a rollout, it processes that rollout twice: once using the actual goal the agent was pursuing while generating it, and again using the final state of that rollout as the goal. This guarantees that half of all rollouts processed will be ones that achieved their goals, and hence received a reward. To use this feature, specify use_hindsight=True to the constructor. The environment must have a method defined as follows:

def apply_hindsight(self, states, actions, goal):

… return new_states, rewards

The method receives the list of states generated during the rollout, the action taken for each one, and a new goal state. It should generate a new list of states that are identical to the input ones, except specifying the new goal. It should return that list of states, and the rewards that would have been received for taking the specified actions from those states. The output arrays may be shorter than the input ones, if the modified rollout would have terminated sooner.

Example

>>> import deepchem as dc
>>> import numpy as np
>>> import torch
>>> import torch.nn.functional as F
>>> from deepchem.rl.torch_rl import A2C
>>> class RouletteEnvironment(dc.rl.Environment):
...     def __init__(self):
...         super(RouletteEnvironment, self).__init__([(1,)], 38)
...         self._state = [np.array([0])]
...     def step(self, action):
...         if action == 37:
...             self._terminated = True  # Walk away.
...             return 0.0
...         wheel = np.random.randint(37)
...         if wheel == 0:
...             if action == 0:
...                 return 35.0
...             return -1.0
...         if action != 0 and wheel % 2 == action % 2:
...             return 1.0
...         return -1.0
...     def reset(self):
...         self._terminated = False
>>> class TestPolicy(dc.rl.Policy):
...     def __init__(self, env):
...         super(TestPolicy, self).__init__(['action_prob', 'value'])
...         self.env = env
...     def create_model(self, **kwargs):
...         env = self.env
...         class TestModel(nn.Module):
...             def __init__(self):
...                 super(TestModel, self).__init__()
...                 self.action = nn.Parameter(torch.ones(env.n_actions, dtype=torch.float32))
...                 self.value = nn.Parameter(torch.tensor([0.0], dtype=torch.float32))
...             def forward(self, inputs):
...                 prob = F.softmax(torch.reshape(self.action, (-1, env.n_actions)))
...                 return prob, self.value
...         return TestModel()
>>> env = RouletteEnvironment()
>>> policy = TestPolicy(env)
>>> a2c = A2C(env, policy, max_rollout_length=20, optimizer=Adam(learning_rate=0.001))
>>> a2c.fit(1000)
>>> action_prob, value = a2c.predict([[0]])
__init__(env: Environment, policy: Policy, max_rollout_length: int = 20, discount_factor: float = 0.99, advantage_lambda: float = 0.98, value_weight: float = 1.0, entropy_weight: float = 0.01, optimizer: Optimizer | None = None, model_dir: str | None = None, use_hindsight: bool = False, device: device | None = None) None[source]

Create an object for optimizing a policy.

Parameters:
  • env (Environment) – the Environment to interact with

  • policy (Policy) – the Policy to optimize. It must have outputs with the names ‘action_prob’ and ‘value’ (for discrete action spaces) or ‘action_mean’, ‘action_std’, and ‘value’ (for continuous action spaces)

  • max_rollout_length (int) – the maximum length of rollouts to generate

  • discount_factor (float) – the discount factor to use when computing rewards

  • advantage_lambda (float) – the parameter for trading bias vs. variance in Generalized Advantage Estimation

  • value_weight (float) – a scale factor for the value loss term in the loss function

  • entropy_weight (float) – a scale factor for the entropy term in the loss function

  • optimizer (Optimizer) – the optimizer to use. If None, a default optimizer is used.

  • model_dir (str) – the directory in which the model will be saved. If None, a temporary directory will be created.

  • use_hindsight (bool) – if True, use Hindsight Experience Replay

  • device (torch.device, optional (default None)) – the device on which to run computations. If None, a device is chosen automatically.

fit(total_steps: int, max_checkpoints_to_keep: int = 5, checkpoint_interval: int = 600, restore: bool = False) None[source]

Train the policy.

Parameters:
  • total_steps (int) – the total number of time steps to perform on the environment, across all rollouts on all threads

  • max_checkpoints_to_keep (int) – the maximum number of checkpoint files to keep. When this number is reached, older files are deleted.

  • checkpoint_interval (float) – the time interval at which to save checkpoints, measured in seconds

  • restore (bool) – if True, restore the model from the most recent checkpoint and continue training from there. If False, retrain the model from scratch.

predict(state: ndarray, use_saved_states: bool = True, save_states: bool = True) List[ndarray][source]

Compute the policy’s output predictions for a state. If the policy involves recurrent layers, this method can preserve their internal states between calls. Use the use_saved_states and save_states arguments to specify how it should behave.

Parameters:
  • state (array or list of arrays) – the state of the environment for which to generate predictions

  • use_saved_states (bool) – if True, the states most recently saved by a previous call to predict() or select_action() will be used as the initial states. If False, the internal states of all recurrent layers will be set to the initial values defined by the policy before computing the predictions.

  • save_states (bool) – if True, the internal states of all recurrent layers at the end of the calculation will be saved, and any previously saved states will be discarded. If False, the states at the end of the calculation will be discarded, and any previously saved states will be kept.

Return type:

the array of action probabilities, and the estimated value function

select_action(state: List[ndarray], deterministic: bool = False, use_saved_states: bool = True, save_states: bool = True) int[source]

Select an action to perform based on the environment’s state. If the policy involves recurrent layers, this method can preserve their internal states between calls. Use the use_saved_states and save_states arguments to specify how it should behave.

Parameters:
  • state (array or list of arrays) – the state of the environment for which to select an action

  • deterministic (bool) – if True, always return the best action (that is, the one with highest probability). If False, randomly select an action based on the computed probabilities.

  • use_saved_states (bool) – if True, the states most recently saved by a previous call to predict() or select_action() will be used as the initial states. If False, the internal states of all recurrent layers will be set to the initial values defined by the policy before computing the predictions.

  • save_states (bool) – if True, the internal states of all recurrent layers at the end of the calculation will be saved, and any previously saved states will be discarded. If False, the states at the end of the calculation will be discarded, and any previously saved states will be kept.

Return type:

the index of the selected action

restore(strict: bool | None = True) None[source]

Reload the model parameters from the most recent checkpoint file.

save_checkpoint(max_checkpoints_to_keep: int = 5, model_dir: str | None = None) None[source]

Save a checkpoint to disk. Usually you do not need to call this method, since fit() saves checkpoints automatically. If you have disabled automatic checkpointing during fitting, this can be called to manually write checkpoints.

Parameters:
  • max_checkpoints_to_keep (int) – the maximum number of checkpoints to keep. Older checkpoints are discarded.

  • model_dir (str, default None) – Model directory to save checkpoint to. If None, revert to self.model_dir

get_checkpoints(model_dir: str | None = None) List[str][source]

Get a list of all available checkpoint files.

Parameters:

model_dir (str, default None) – Directory to get list of checkpoints from. Reverts to self.model_dir if None

PPO

class PPO(env: Environment, policy: Policy, max_rollout_length: int = 20, optimization_rollouts: int = 8, optimization_epochs: int = 4, batch_size: int = 64, clipping_width: float = 0.2, discount_factor: float = 0.99, advantage_lambda: float = 0.98, value_weight: float = 1.0, entropy_weight: float = 0.01, optimizer: Optimizer | None = None, model_dir: str | None = None, use_hindsight: bool = False, device: device | None = None)[source]

Implements the Proximal Policy Optimization (PPO) algorithm for reinforcement learning.

The algorithm is described in Schulman et al, “Proximal Policy Optimization Algorithms” (https://openai-public.s3-us-west-2.amazonaws.com/blog/2017-07/ppo/ppo-arxiv.pdf). This class requires the policy to output two quantities: a vector giving the probability of taking each action, and an estimate of the value function for the current state. It optimizes both outputs at once using a loss that is the sum of three terms:

  1. The policy loss, which seeks to maximize the discounted reward for each action.

  2. The value loss, which tries to make the value estimate match the actual discounted reward

    that was attained at each step.

  3. An entropy term to encourage exploration.

This class only supports environments with discrete action spaces, not continuous ones. The “action” argument passed to the environment is an integer, giving the index of the action to perform.

This class supports Generalized Advantage Estimation as described in Schulman et al., “High-Dimensional Continuous Control Using Generalized Advantage Estimation” (https://arxiv.org/abs/1506.02438). This is a method of trading off bias and variance in the advantage estimate, which can sometimes improve the rate of convergance. Use the advantage_lambda parameter to adjust the tradeoff.

This class supports Hindsight Experience Replay as described in Andrychowicz et al., “Hindsight Experience Replay” (https://arxiv.org/abs/1707.01495). This is a method that can enormously accelerate learning when rewards are very rare. It requires that the environment state contains information about the goal the agent is trying to achieve. Each time it generates a rollout, it processes that rollout twice: once using the actual goal the agent was pursuing while generating it, and again using the final state of that rollout as the goal. This guarantees that half of all rollouts processed will be ones that achieved their goals, and hence received a reward.

To use this feature, specify use_hindsight=True to the constructor. The environment must have a method defined as follows:

def apply_hindsight(self, states, actions, goal):

… return new_states, rewards

The method receives the list of states generated during the rollout, the action taken for each one, and a new goal state. It should generate a new list of states that are identical to the input ones, except specifying the new goal. It should return that list of states, and the rewards that would have been received for taking the specified actions from those states. The output arrays may be shorter than the input ones, if the modified rollout would have terminated sooner.

Example

>>> import deepchem as dc
>>> import numpy as np
>>> import torch
>>> import torch.nn.functional as F
>>> from deepchem.rl.torch_rl import PPO
>>> class RouletteEnvironment(dc.rl.Environment):
...     def __init__(self):
...         super(RouletteEnvironment, self).__init__([(1,)], 38)
...         self._state = [np.array([0])]
...     def step(self, action):
...         if action == 37:
...             self._terminated = True  # Walk away.
...             return 0.0
...         wheel = np.random.randint(37)
...         if wheel == 0:
...             if action == 0:
...                 return 35.0
...             return -1.0
...         if action != 0 and wheel % 2 == action % 2:
...             return 1.0
...         return -1.0
...     def reset(self):
...         self._terminated = False
>>> class TestPolicy(dc.rl.Policy):
...     def __init__(self):
...         super(TestPolicy, self).__init__(['action_prob', 'value'])
...     def create_model(self, **kwargs):
...         class TestModel(nn.Module):
...             def __init__(self):
...                 super(TestModel, self).__init__()
...                 self.action = nn.Parameter(torch.ones(env.n_actions, dtype=torch.float32))
...                 self.value = nn.Parameter(torch.tensor([0.0], dtype=torch.float32))
...             def forward(self, inputs):
...                 prob = F.softmax(torch.reshape(self.action, (-1, env.n_actions)))
...                 return prob, self.value
...         return TestModel()
>>> env = RouletteEnvironment()
>>> ppo = PPO(env, TestPolicy(), max_rollout_length=20, optimization_epochs=8, optimizer=Adam(learning_rate=0.003))
>>> ppo.fit(100000)
>>> action_prob, value = ppo.predict([[0]])
__init__(env: Environment, policy: Policy, max_rollout_length: int = 20, optimization_rollouts: int = 8, optimization_epochs: int = 4, batch_size: int = 64, clipping_width: float = 0.2, discount_factor: float = 0.99, advantage_lambda: float = 0.98, value_weight: float = 1.0, entropy_weight: float = 0.01, optimizer: Optimizer | None = None, model_dir: str | None = None, use_hindsight: bool = False, device: device | None = None) None[source]

Create an object for optimizing a policy.

Parameters:
  • env (Environment) – the Environment to interact with

  • policy (Policy) – the Policy to optimize. It must have outputs with the names ‘action_prob’ and ‘value’, corresponding to the action probabilities and value estimate

  • max_rollout_length (int) – the maximum length of rollouts to generate

  • optimization_rollouts (int) – the number of rollouts to generate for each iteration of optimization

  • optimization_epochs (int) – the number of epochs of optimization to perform within each iteration

  • batch_size (int) – the batch size to use during optimization. If this is 0, each rollout will be used as a separate batch.

  • clipping_width (float) – in computing the PPO loss function, the probability ratio is clipped to the range (1-clipping_width, 1+clipping_width)

  • discount_factor (float) – the discount factor to use when computing rewards

  • advantage_lambda (float) – the parameter for trading bias vs. variance in Generalized Advantage Estimation

  • value_weight (float) – a scale factor for the value loss term in the loss function

  • entropy_weight (float) – a scale factor for the entropy term in the loss function

  • optimizer (Optimizer) – the optimizer to use. If None, a default optimizer is used.

  • model_dir (str) – the directory in which the model will be saved. If None, a temporary directory will be created.

  • use_hindsight (bool) – if True, use Hindsight Experience Replay

fit(total_steps: int, max_checkpoints_to_keep: int = 5, checkpoint_interval: int = 600, restore: bool = False) None[source]

Train the policy.

Parameters:
  • total_steps (int) – the total number of time steps to perform on the environment, across all rollouts on all threads

  • max_checkpoints_to_keep (int) – the maximum number of checkpoint files to keep. When this number is reached, older files are deleted.

  • checkpoint_interval (float) – the time interval at which to save checkpoints, measured in seconds

  • restore (bool) – if True, restore the model from the most recent checkpoint and continue training from there. If False, retrain the model from scratch.

predict(state: ndarray, use_saved_states: bool = True, save_states: bool = True) List[ndarray][source]

Compute the policy’s output predictions for a state.

If the policy involves recurrent layers, this method can preserve their internal states between calls. Use the use_saved_states and save_states arguments to specify how it should behave.

Parameters:
  • state (array or list of arrays) – the state of the environment for which to generate predictions

  • use_saved_states (bool) – if True, the states most recently saved by a previous call to predict() or select_action() will be used as the initial states. If False, the internal states of all recurrent layers will be set to the initial values defined by the policy before computing the predictions.

  • save_states (bool) – if True, the internal states of all recurrent layers at the end of the calculation will be saved, and any previously saved states will be discarded. If False, the states at the end of the calculation will be discarded, and any previously saved states will be kept.

Return type:

the array of action probabilities, and the estimated value function

select_action(state: List[ndarray], deterministic: bool = False, use_saved_states: bool = True, save_states: bool = True) int[source]

Select an action to perform based on the environment’s state.

If the policy involves recurrent layers, this method can preserve their internal states between calls. Use the use_saved_states and save_states arguments to specify how it should behave.

Parameters:
  • state (array or list of arrays) – the state of the environment for which to select an action

  • deterministic (bool) – if True, always return the best action (that is, the one with highest probability). If False, randomly select an action based on the computed probabilities.

  • use_saved_states (bool) – if True, the states most recently saved by a previous call to predict() or select_action() will be used as the initial states. If False, the internal states of all recurrent layers will be set to the initial values defined by the policy before computing the predictions.

  • save_states (bool) – if True, the internal states of all recurrent layers at the end of the calculation will be saved, and any previously saved states will be discarded. If False, the states at the end of the calculation will be discarded, and any previously saved states will be kept.

Return type:

the index of the selected action

restore(strict: bool | None = True) None[source]

Reload the model parameters from the most recent checkpoint file.

save_checkpoint(max_checkpoints_to_keep: int = 5, model_dir: str | None = None) None[source]

Save a checkpoint to disk.

Usually you do not need to call this method, since fit() saves checkpoints automatically. If you have disabled automatic checkpointing during fitting, this can be called to manually write checkpoints.

Parameters:
  • max_checkpoints_to_keep (int) – the maximum number of checkpoints to keep. Older checkpoints are discarded.

  • model_dir (str, default None) – Model directory to save checkpoint to. If None, revert to self.model_dir

get_checkpoints(model_dir: str | None = None) List[str][source]

Get a list of all available checkpoint files.

Parameters:

model_dir (str, default None) – Directory to get list of checkpoints from. Reverts to self.model_dir if None

class PPOLoss(value_weight: float, entropy_weight: float, clipping_width: float, action_prob_index: int, value_index: int)[source]

This class computes the loss function for PPO.

Example

>>> import deepchem as dc
>>> import numpy as np
>>> import torch
>>> import torch.nn.functional as F
>>> outputs = [torch.tensor([[0.02, 0.02, 0.02, 0.02, 0.02, 0.02, 0.02, 0.02, 0.02, 0.02, 0.02, 0.02, 0.02, 0.02, 0.02, 0.02, 0.02, 0.02,0.02, 0.02, 0.02, 0.02, 0.02, 0.02, 0.02, 0.02, 0.02, 0.02, 0.02, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2]]), torch.tensor([0.], requires_grad = True)]
>>> labels = np.array([[0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]], dtype = np.float32)
>>> discount = np.array([-1.0203744, -0.02058018, 0.98931295, 2.009407, 1.019603, 0.01980097, -0.9901, 0.01, -1. , 0. ], dtype=np.float32)
>>> advantage = np.array([-1.0203744 ,-0.02058018, 0.98931295, 2.009407, 1.019603, 0.01980097, -0.9901 ,0.01 ,-1. , 0.], dtype = np.float32)
>>> old_prob = np.array([0.28183755, 0.95147914, 0.87922776, 0.8037652 , 0.11757819, 0.271103  , 0.21057394, 0.78721744, 0.6545527 , 0.8832647 ], dtype=np.float32)
>>> loss = PPOLoss(value_weight = 1.0, entropy_weight = 0.01, clipping_width = 0.2, action_prob_index = 0, value_index = 1)
>>> loss_val = loss(outputs, [labels], [discount, advantage, old_prob])
>>> loss_val
tensor(1.0761, grad_fn=<SubBackward0>)
__init__(value_weight: float, entropy_weight: float, clipping_width: float, action_prob_index: int, value_index: int)[source]