Model Classes

DeepChem maintains an extensive collection of models for scientific applications.

Model Cheatsheet

If you’re just getting started with DeepChem, you’re probably interested in the basics. The place to get started is this “model cheatsheet” that lists various types of custom DeepChem models. Note that some wrappers like SklearnModel and XGBoostModel which wrap external machine learning libraries are excluded, but this table is otherwise complete.

As a note about how to read this table, each row describes what’s needed to invoke a given model. Some models must be applied with given Transformer or Featurizer objects. Some models also have custom training methods. You can read off what’s needed to train the model from the table below.

Model Type Input Type Transformations Acceptable Featurizers Fit Method
AtomicConvModel Classifier/ Regressor Tuple   ComplexNeighborListFragmentAtomicCoordinates fit
ChemCeption Classifier/ Regressor Tensor of shape (N, M, c)   SmilesToImage fit
CNN Classifier/ Regressor Tensor of shape (N, c) or (N, M, c) or (N, M, L, c)     fit
DTNNModel Classifier/ Regressor Matrix of shape (N, N)   CoulombMatrix fit
DAGModel Classifier/ Regressor ConvMol DAGTransformer ConvMolFeaturizer fit
GraphConvModel Classifier/ Regressor ConvMol   ConvMolFeaturizer fit
MPNNModel Classifier/ Regressor WeaveMol   WeaveFeaturizer fit
MultitaskClassifier Classifier Vector of shape (N,)   CircularFingerprint, RDKitDescriptors, CoulombMatrixEig, RdkitGridFeaturizer, BindingPocketFeaturizer, AdjacencyFingerprint, ElementPropertyFingerprint, fit
MultitaskRegressor Regressor Vector of shape (N,)   CircularFingerprint, RDKitDescriptors, CoulombMatrixEig, RdkitGridFeaturizer, BindingPocketFeaturizer, AdjacencyFingerprint, ElementPropertyFingerprint, fit
MultitaskFitTransformRegressor Regressor Vector of shape (N,) Any CircularFingerprint, RDKitDescriptors, CoulombMatrixEig, RdkitGridFeaturizer, BindingPocketFeaturizer, AdjacencyFingerprint, ElementPropertyFingerprint, fit
MultitaskIRVClassifier Classifier Vector of shape (N,) IRVTransformer CircularFingerprint, RDKitDescriptors, CoulombMatrixEig, RdkitGridFeaturizer, BindingPocketFeaturizer, AdjacencyFingerprint, ElementPropertyFingerprint, fit
ProgressiveMultitaskClassifier Classifier Vector of shape (N,)   CircularFingerprint, RDKitDescriptors, CoulombMatrixEig, RdkitGridFeaturizer, BindingPocketFeaturizer, AdjacencyFingerprint, ElementPropertyFingerprint, fit
ProgressiveMultitaskRegressor Regressor Vector of shape (N,)   CircularFingerprint, RDKitDescriptors, CoulombMatrixEig, RdkitGridFeaturizer, BindingPocketFeaturizer, AdjacencyFingerprint, ElementPropertyFingerprint, fit
RobustMultitaskClassifier Classifier Vector of shape (N,)   CircularFingerprint, RDKitDescriptors, CoulombMatrixEig, RdkitGridFeaturizer, BindingPocketFeaturizer, AdjacencyFingerprint, ElementPropertyFingerprint, fit
RobustMultitaskRegressor Regressor Vector of shape (N,)   CircularFingerprint, RDKitDescriptors, CoulombMatrixEig, RdkitGridFeaturizer, BindingPocketFeaturizer, AdjacencyFingerprint, ElementPropertyFingerprint, fit
ScScoreModel Classifier Vector of shape (N,)   CircularFingerprint, RDKitDescriptors, CoulombMatrixEig, RdkitGridFeaturizer, BindingPocketFeaturizer, AdjacencyFingerprint, ElementPropertyFingerprint, fit
SeqToSeq Sequence Sequence     fit_sequences
Smiles2Vec Classifier/ Regressor Sequence   SmilesToSeq fit
TextCNNModel Classifier/ Regressor String     fit
WGAN Adversarial Pair     fit_gan

Model

class deepchem.models.Model(model_instance=None, model_dir=None, **kwargs)[source]

Abstract base class for different ML models.

__init__(model_instance=None, model_dir=None, **kwargs)[source]

Abstract class for all models.

Parameters:
  • model_instance (object) – Wrapper around ScikitLearn/Keras/Tensorflow model object.
  • model_dir (str) – Path to directory where model will be stored.
evaluate(dataset, metrics, transformers=[], per_task_metrics=False)[source]

Evaluates the performance of this model on specified dataset.

Parameters:
  • dataset (dc.data.Dataset) – Dataset object.
  • metric (deepchem.metrics.Metric) – Evaluation metric
  • transformers (list) – List of deepchem.transformers.Transformer
  • per_task_metrics (bool) – If True, return per-task scores.
Returns:

Maps tasks to scores under metric.

Return type:

dict

fit(dataset, nb_epoch=10, batch_size=50, **kwargs)[source]

Fits a model on data in a Dataset object.

fit_on_batch(X, y, w)[source]

Updates existing model with new information.

static get_model_filename(model_dir)[source]

Given model directory, obtain filename for the model itself.

get_num_tasks()[source]

Get number of tasks.

get_params(deep=True)[source]

Get parameters for this estimator.

Parameters:deep (bool, default=True) – If True, will return the parameters for this estimator and contained subobjects that are estimators.
Returns:params – Parameter names mapped to their values.
Return type:mapping of string to any
static get_params_filename(model_dir)[source]

Given model directory, obtain filename for the model itself.

get_task_type()[source]

Currently models can only be classifiers or regressors.

predict(dataset, transformers=[], batch_size=None)[source]

Uses self to make predictions on provided Dataset object.

Returns:numpy ndarray of shape (n_samples,)
Return type:y_pred
predict_on_batch(X, **kwargs)[source]

Makes predictions on given batch of new data.

Parameters:X (np.ndarray) – Features
reload()[source]

Reload trained model from disk.

save()[source]

Dispatcher function for saving.

Each subclass is responsible for overriding this method.

set_params(**params)[source]

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters:**params (dict) – Estimator parameters.
Returns:self – Estimator instance.
Return type:object

SklearnModel

class deepchem.models.SklearnModel(model_instance=None, model_dir=None, **kwargs)[source]

Abstract base class for different ML models.

__init__(model_instance=None, model_dir=None, **kwargs)[source]
Parameters:
  • model_instance (sklearn model) – Instance of model to wrap.
  • model_dir (str) – If specified, the model will be saved in this directory.
  • kwargs (dict) – kwargs[‘use_weights’] is a bool which determines if we pass weights into self.model_instance.fit()
evaluate(dataset, metrics, transformers=[], per_task_metrics=False)[source]

Evaluates the performance of this model on specified dataset.

Parameters:
  • dataset (dc.data.Dataset) – Dataset object.
  • metric (deepchem.metrics.Metric) – Evaluation metric
  • transformers (list) – List of deepchem.transformers.Transformer
  • per_task_metrics (bool) – If True, return per-task scores.
Returns:

Maps tasks to scores under metric.

Return type:

dict

fit(dataset, **kwargs)[source]

Fits SKLearn model to data.

fit_on_batch(X, y, w)[source]

Updates existing model with new information.

static get_model_filename(model_dir)[source]

Given model directory, obtain filename for the model itself.

get_num_tasks()[source]

Number of tasks for this model. Defaults to 1

get_params(deep=True)[source]

Get parameters for this estimator.

Parameters:deep (bool, default=True) – If True, will return the parameters for this estimator and contained subobjects that are estimators.
Returns:params – Parameter names mapped to their values.
Return type:mapping of string to any
static get_params_filename(model_dir)[source]

Given model directory, obtain filename for the model itself.

get_task_type()[source]

Currently models can only be classifiers or regressors.

predict(X, transformers=[])[source]

Makes predictions on dataset.

predict_on_batch(X, pad_batch=False)[source]

Makes predictions on batch of data.

Parameters:
  • X (np.ndarray) – Features
  • pad_batch (bool, optional) – Ignored for Sklearn Model. Only used for Tensorflow models with rigid batch-size requirements.
reload()[source]

Loads sklearn model from joblib file on disk.

save()[source]

Saves sklearn model to disk using joblib.

set_params(**params)[source]

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters:**params (dict) – Estimator parameters.
Returns:self – Estimator instance.
Return type:object

XGBoostModel

class deepchem.models.XGBoostModel(model_instance=None, model_dir=None, **kwargs)[source]

Abstract base class for XGBoost model.

__init__(model_instance=None, model_dir=None, **kwargs)[source]

Abstract class for XGBoost models.

Parameters:
  • model_instance (object) – Scikit-learn wrapper interface of xgboost
  • model_dir (str) – Path to directory where model will be stored.
evaluate(dataset, metrics, transformers=[], per_task_metrics=False)[source]

Evaluates the performance of this model on specified dataset.

Parameters:
  • dataset (dc.data.Dataset) – Dataset object.
  • metric (deepchem.metrics.Metric) – Evaluation metric
  • transformers (list) – List of deepchem.transformers.Transformer
  • per_task_metrics (bool) – If True, return per-task scores.
Returns:

Maps tasks to scores under metric.

Return type:

dict

fit(dataset, **kwargs)[source]

Fits XGBoost model to data.

fit_on_batch(X, y, w)[source]

Updates existing model with new information.

static get_model_filename(model_dir)[source]

Given model directory, obtain filename for the model itself.

get_num_tasks()[source]

Number of tasks for this model. Defaults to 1

get_params(deep=True)[source]

Get parameters for this estimator.

Parameters:deep (bool, default=True) – If True, will return the parameters for this estimator and contained subobjects that are estimators.
Returns:params – Parameter names mapped to their values.
Return type:mapping of string to any
static get_params_filename(model_dir)[source]

Given model directory, obtain filename for the model itself.

get_task_type()[source]

Currently models can only be classifiers or regressors.

predict(X, transformers=[])[source]

Makes predictions on dataset.

predict_on_batch(X, pad_batch=False)[source]

Makes predictions on batch of data.

Parameters:
  • X (np.ndarray) – Features
  • pad_batch (bool, optional) – Ignored for Sklearn Model. Only used for Tensorflow models with rigid batch-size requirements.
reload()[source]

Loads sklearn model from joblib file on disk.

save()[source]

Saves sklearn model to disk using joblib.

set_params(**params)[source]

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters:**params (dict) – Estimator parameters.
Returns:self – Estimator instance.
Return type:object

KerasModel

DeepChem extensively uses Keras to build powerful machine learning models.

Training loss and validation metrics can be automatically logged to Weights & Biases with the following commands:

# Install wandb in shell
pip install wandb

# Login in shell (required only once)
wandb login

# Start a W&B run in your script (refer to docs for optional parameters)
wandb.init(project="my project")

# Set `wandb` arg when creating `KerasModel`
model = KerasModel(…, wandb=True)
class deepchem.models.KerasModel(model, loss, output_types=None, batch_size=100, model_dir=None, learning_rate=0.001, optimizer=None, tensorboard=False, wandb=False, log_frequency=100, **kwargs)[source]

This is a DeepChem model implemented by a Keras model.

This class provides several advantages over using the Keras model’s fitting and prediction methods directly.

  1. It provides better integration with the rest of DeepChem, such as direct support for Datasets and Transformers.
  2. It defines the loss in a more flexible way. In particular, Keras does not support multidimensional weight matrices, which makes it impossible to implement most multitask models with Keras.
  3. It provides various additional features not found in the Keras Model class, such as uncertainty prediction and saliency mapping.

The loss function for a model can be defined in two different ways. For models that have only a single output and use a standard loss function, you can simply provide a dc.models.losses.Loss object. This defines the loss for each sample or sample/task pair. The result is automatically multiplied by the weights and averaged over the batch. Any additional losses computed by model layers, such as weight decay penalties, are also added.

For more complicated cases, you can instead provide a function that directly computes the total loss. It must be of the form f(outputs, labels, weights), taking the list of outputs from the model, the expected values, and any weight matrices. It should return a scalar equal to the value of the loss function for the batch. No additional processing is done to the result; it is up to you to do any weighting, averaging, adding of penalty terms, etc.

You can optionally provide an output_types argument, which describes how to interpret the model’s outputs. This should be a list of strings, one for each output. You can use an arbitrary output_type for a output, but some output_types are special and will undergo extra processing:

  • ‘prediction’: This is a normal output, and will be returned by predict(). If output types are not specified, all outputs are assumed to be of this type.
  • ‘loss’: This output will be used in place of the normal outputs for computing the loss function. For example, models that output probability distributions usually do it by computing unbounded numbers (the logits), then passing them through a softmax function to turn them into probabilities. When computing the cross entropy, it is more numerically stable to use the logits directly rather than the probabilities. You can do this by having the model produce both probabilities and logits as outputs, then specifying output_types=[‘prediction’, ‘loss’]. When predict() is called, only the first output (the probabilities) will be returned. But during training, it is the second output (the logits) that will be passed to the loss function.
  • ‘variance’: This output is used for estimating the uncertainty in another output. To create a model that can estimate uncertainty, there must be the same number of ‘prediction’ and ‘variance’ outputs. Each variance output must have the same shape as the corresponding prediction output, and each element is an estimate of the variance in the corresponding prediction. Also be aware that if a model supports uncertainty, it MUST use dropout on every layer, and dropout most be enabled during uncertainty prediction. Otherwise, the uncertainties it computes will be inaccurate.
  • other: Arbitrary output_types can be used to extract outputs produced by the model, but will have no additional processing performed.
__init__(model, loss, output_types=None, batch_size=100, model_dir=None, learning_rate=0.001, optimizer=None, tensorboard=False, wandb=False, log_frequency=100, **kwargs)[source]

Create a new KerasModel.

Parameters:
  • model (tf.keras.Model) – the Keras model implementing the calculation
  • loss (dc.models.losses.Loss or function) – a Loss or function defining how to compute the training loss for each batch, as described above
  • output_types (list of strings) – the type of each output from the model, as described above
  • batch_size (int) – default batch size for training and evaluating
  • model_dir (str) – the directory on disk where the model will be stored. If this is None, a temporary directory is created.
  • learning_rate (float or LearningRateSchedule) – the learning rate to use for fitting. If optimizer is specified, this is ignored.
  • optimizer (Optimizer) – the optimizer to use for fitting. If this is specified, learning_rate is ignored.
  • tensorboard (bool) – whether to log progress to TensorBoard during training
  • wandb (bool) – whether to log progress to Weights & Biases during training
  • log_frequency (int) – The frequency at which to log data. Data is logged using logging by default. If tensorboard is set, data is also logged to TensorBoard. Logging happens at global steps. Roughly, a global step corresponds to one batch of training. If you’d like a printout every 10 batch steps, you’d set log_frequency=10 for example.
compute_saliency(X)[source]

Compute the saliency map for an input sample.

This computes the Jacobian matrix with the derivative of each output element with respect to each input element. More precisely,

  • If this model has a single output, it returns a matrix of shape (output_shape, input_shape) with the derivatives.
  • If this model has multiple outputs, it returns a list of matrices, one for each output.

This method cannot be used on models that take multiple inputs.

Parameters:X (ndarray) – the input data for a single sample
Returns:
Return type:the Jacobian matrix, or a list of matrices
default_generator(dataset, epochs=1, mode='fit', deterministic=True, pad_batches=True)[source]

Create a generator that iterates batches for a dataset.

Subclasses may override this method to customize how model inputs are generated from the data.

Parameters:
  • dataset (Dataset) – the data to iterate
  • epochs (int) – the number of times to iterate over the full dataset
  • mode (str) – allowed values are ‘fit’ (called during training), ‘predict’ (called during prediction), and ‘uncertainty’ (called during uncertainty prediction)
  • deterministic (bool) – whether to iterate over the dataset in order, or randomly shuffle the data for each epoch
  • pad_batches (bool) – whether to pad each batch up to this model’s preferred batch size
Returns:

  • a generator that iterates batches, each represented as a tuple of lists
  • ([inputs], [outputs], [weights])

evaluate(dataset, metrics, transformers=[], per_task_metrics=False)[source]

Evaluates the performance of this model on specified dataset.

Parameters:
  • dataset (dc.data.Dataset) – Dataset object.
  • metric (deepchem.metrics.Metric) – Evaluation metric
  • transformers (list) – List of deepchem.transformers.Transformer
  • per_task_metrics (bool) – If True, return per-task scores.
Returns:

Maps tasks to scores under metric.

Return type:

dict

evaluate_generator(generator, metrics, transformers=[], per_task_metrics=False)[source]

Evaluate the performance of this model on the data produced by a generator.

Parameters:
  • generator (generator) – this should generate batches, each represented as a tuple of the form (inputs, labels, weights).
  • metric (deepchem.metrics.Metric) – Evaluation metric
  • transformers (list of dc.trans.Transformers) – Transformers that the input data has been transformed by. The output is passed through these transformers to undo the transformations.
  • per_task_metrics (bool) – If True, return per-task scores.
Returns:

Maps tasks to scores under metric.

Return type:

dict

fit(dataset, nb_epoch=10, max_checkpoints_to_keep=5, checkpoint_interval=1000, deterministic=False, restore=False, variables=None, loss=None, callbacks=[])[source]

Train this model on a dataset.

Parameters:
  • dataset (Dataset) – the Dataset to train on
  • nb_epoch (int) – the number of epochs to train for
  • max_checkpoints_to_keep (int) – the maximum number of checkpoints to keep. Older checkpoints are discarded.
  • checkpoint_interval (int) – the frequency at which to write checkpoints, measured in training steps. Set this to 0 to disable automatic checkpointing.
  • deterministic (bool) – if True, the samples are processed in order. If False, a different random order is used for each epoch.
  • restore (bool) – if True, restore the model from the most recent checkpoint and continue training from there. If False, retrain the model from scratch.
  • variables (list of tf.Variable) – the variables to train. If None (the default), all trainable variables in the model are used.
  • loss (function) – a function of the form f(outputs, labels, weights) that computes the loss for each batch. If None (the default), the model’s standard loss function is used.
  • callbacks (function or list of functions) – one or more functions of the form f(model, step) that will be invoked after every step. This can be used to perform validation, logging, etc.
fit_generator(generator, max_checkpoints_to_keep=5, checkpoint_interval=1000, restore=False, variables=None, loss=None, callbacks=[])[source]

Train this model on data from a generator.

Parameters:
  • generator (generator) – this should generate batches, each represented as a tuple of the form (inputs, labels, weights).
  • max_checkpoints_to_keep (int) – the maximum number of checkpoints to keep. Older checkpoints are discarded.
  • checkpoint_interval (int) – the frequency at which to write checkpoints, measured in training steps. Set this to 0 to disable automatic checkpointing.
  • restore (bool) – if True, restore the model from the most recent checkpoint and continue training from there. If False, retrain the model from scratch.
  • variables (list of tf.Variable) – the variables to train. If None (the default), all trainable variables in the model are used.
  • loss (function) – a function of the form f(outputs, labels, weights) that computes the loss for each batch. If None (the default), the model’s standard loss function is used.
  • callbacks (function or list of functions) – one or more functions of the form f(model, step) that will be invoked after every step. This can be used to perform validation, logging, etc.
Returns:

Return type:

the average loss over the most recent checkpoint interval

fit_on_batch(X, y, w, variables=None, loss=None, callbacks=[], checkpoint=True, max_checkpoints_to_keep=5)[source]

Perform a single step of training.

Parameters:
  • X (ndarray) – the inputs for the batch
  • y (ndarray) – the labels for the batch
  • w (ndarray) – the weights for the batch
  • variables (list of tf.Variable) – the variables to train. If None (the default), all trainable variables in the model are used.
  • loss (function) – a function of the form f(outputs, labels, weights) that computes the loss for each batch. If None (the default), the model’s standard loss function is used.
  • callbacks (function or list of functions) – one or more functions of the form f(model, step) that will be invoked after every step. This can be used to perform validation, logging, etc.
  • checkpoint (bool) – if true, save a checkpoint after performing the training step
  • max_checkpoints_to_keep (int) – the maximum number of checkpoints to keep. Older checkpoints are discarded.
get_checkpoints(model_dir=None)[source]

Get a list of all available checkpoint files.

Parameters:model_dir (str, default None) – Directory to get list of checkpoints from. Reverts to self.model_dir if None
get_global_step()[source]

Get the number of steps of fitting that have been performed.

static get_model_filename(model_dir)[source]

Given model directory, obtain filename for the model itself.

get_num_tasks()[source]

Get number of tasks.

get_params(deep=True)[source]

Get parameters for this estimator.

Parameters:deep (bool, default=True) – If True, will return the parameters for this estimator and contained subobjects that are estimators.
Returns:params – Parameter names mapped to their values.
Return type:mapping of string to any
static get_params_filename(model_dir)[source]

Given model directory, obtain filename for the model itself.

get_task_type()[source]

Currently models can only be classifiers or regressors.

load_from_pretrained(source_model, assignment_map=None, value_map=None, checkpoint=None, model_dir=None, include_top=True, inputs=None, **kwargs)[source]

Copies variable values from a pretrained model. source_model can either be a pretrained model or a model with the same architecture. value_map is a variable-value dictionary. If no value_map is provided, the variable values are restored to the source_model from a checkpoint and a default value_map is created. assignment_map is a dictionary mapping variables from the source_model to the current model. If no assignment_map is provided, one is made from scratch and assumes the model is composed of several different layers, with the final one being a dense layer. include_top is used to control whether or not the final dense layer is used. The default assignment map is useful in cases where the type of task is different (classification vs regression) and/or number of tasks in the setting.

Parameters:
  • source_model (dc.KerasModel, required) – source_model can either be the pretrained model or a dc.KerasModel with the same architecture as the pretrained model. It is used to restore from a checkpoint, if value_map is None and to create a default assignment map if assignment_map is None
  • assignment_map (Dict, default None) – Dictionary mapping the source_model variables and current model variables
  • value_map (Dict, default None) – Dictionary containing source_model trainable variables mapped to numpy arrays. If value_map is None, the values are restored and a default variable map is created using the restored values
  • checkpoint (str, default None) – the path to the checkpoint file to load. If this is None, the most recent checkpoint will be chosen automatically. Call get_checkpoints() to get a list of all available checkpoints
  • model_dir (str, default None) – Restore model from custom model directory if needed
  • include_top (bool, default True) – if True, copies the weights and bias associated with the final dense layer. Used only when assignment map is None
  • inputs (List, input tensors for model) – if not None, then the weights are built for both the source and self. This option is useful only for models that are built by subclassing tf.keras.Model, and not using the functional API by tf.keras
predict(dataset, transformers=[], outputs=None, output_types=None)[source]

Uses self to make predictions on provided Dataset object.

Parameters:
  • dataset (dc.data.Dataset) – Dataset to make prediction on
  • transformers (list of dc.trans.Transformers) – Transformers that the input data has been transformed by. The output is passed through these transformers to undo the transformations.
  • outputs (Tensor or list of Tensors) – The outputs to return. If this is None, the model’s standard prediction outputs will be returned. Alternatively one or more Tensors within the model may be specified, in which case the output of those Tensors will be returned.
  • output_types (list of Strings) – The output types to return. Will retrieve all outputs of these types from the model.
Returns:

  • a NumPy array of the model produces a single output, or a list of arrays
  • if it produces multiple outputs

predict_embedding(dataset)[source]

Predicts embeddings created by underlying model if any exist. An embedding must be specified to have output_type of ‘embedding’ in the model definition.

Parameters:dataset (dc.data.Dataset) – Dataset to make prediction on
Returns:
  • a NumPy array of the embeddings model produces, or a list
  • of arrays if it produces multiple embeddings
predict_on_batch(X, transformers=[], outputs=None)[source]

Generates predictions for input samples, processing samples in a batch.

Parameters:
  • X (ndarray) – the input data, as a Numpy array.
  • transformers (list of dc.trans.Transformers) – Transformers that the input data has been transformed by. The output is passed through these transformers to undo the transformations.
  • outputs (Tensor or list of Tensors) – The outputs to return. If this is None, the model’s standard prediction outputs will be returned. Alternatively one or more Tensors within the model may be specified, in which case the output of those Tensors will be returned.
Returns:

  • a NumPy array of the model produces a single output, or a list of arrays
  • if it produces multiple outputs

predict_on_generator(generator, transformers=[], outputs=None, output_types=None)[source]
Parameters:
  • generator (generator) – this should generate batches, each represented as a tuple of the form (inputs, labels, weights).
  • transformers (list of dc.trans.Transformers) – Transformers that the input data has been transformed by. The output is passed through these transformers to undo the transformations.
  • outputs (Tensor or list of Tensors) – The outputs to return. If this is None, the model’s standard prediction outputs will be returned. Alternatively one or more Tensors within the model may be specified, in which case the output of those Tensors will be returned. If outputs is specified, output_types must be None.
  • output_types (String or list of Strings) – If specified, all outputs of this type will be retrieved from the model. If output_types is specified, outputs must be None.
  • Returns – a NumPy array of the model produces a single output, or a list of arrays if it produces multiple outputs
predict_uncertainty(dataset, masks=50)[source]

Predict the model’s outputs, along with the uncertainty in each one.

The uncertainty is computed as described in https://arxiv.org/abs/1703.04977. It involves repeating the prediction many times with different dropout masks. The prediction is computed as the average over all the predictions. The uncertainty includes both the variation among the predicted values (epistemic uncertainty) and the model’s own estimates for how well it fits the data (aleatoric uncertainty). Not all models support uncertainty prediction.

Parameters:
  • dataset (dc.data.Dataset) – Dataset to make prediction on
  • masks (int) – the number of dropout masks to average over
Returns:

  • for each output, a tuple (y_pred, y_std) where y_pred is the predicted
  • value of the output, and each element of y_std estimates the standard
  • deviation of the corresponding element of y_pred

predict_uncertainty_on_batch(X, masks=50)[source]

Predict the model’s outputs, along with the uncertainty in each one.

The uncertainty is computed as described in https://arxiv.org/abs/1703.04977. It involves repeating the prediction many times with different dropout masks. The prediction is computed as the average over all the predictions. The uncertainty includes both the variation among the predicted values (epistemic uncertainty) and the model’s own estimates for how well it fits the data (aleatoric uncertainty). Not all models support uncertainty prediction.

Parameters:
  • X (ndarray) – the input data, as a Numpy array.
  • masks (int) – the number of dropout masks to average over
Returns:

  • for each output, a tuple (y_pred, y_std) where y_pred is the predicted
  • value of the output, and each element of y_std estimates the standard
  • deviation of the corresponding element of y_pred

reload()[source]

Reload trained model from disk.

restore(checkpoint=None, model_dir=None, session=None)[source]

Reload the values of all variables from a checkpoint file.

Parameters:
  • checkpoint (str) – the path to the checkpoint file to load. If this is None, the most recent checkpoint will be chosen automatically. Call get_checkpoints() to get a list of all available checkpoints.
  • model_dir (str, default None) – Directory to restore checkpoint from. If None, use self.model_dir.
  • session (tf.Session(), default None) – Session to run restore ops under. If None, self.session is used.
save()[source]

Dispatcher function for saving.

Each subclass is responsible for overriding this method.

save_checkpoint(max_checkpoints_to_keep=5, model_dir=None)[source]

Save a checkpoint to disk.

Usually you do not need to call this method, since fit() saves checkpoints automatically. If you have disabled automatic checkpointing during fitting, this can be called to manually write checkpoints.

Parameters:
  • max_checkpoints_to_keep (int) – the maximum number of checkpoints to keep. Older checkpoints are discarded.
  • model_dir (str, default None) – Model directory to save checkpoint to. If None, revert to self.model_dir
set_params(**params)[source]

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters:**params (dict) – Estimator parameters.
Returns:self – Estimator instance.
Return type:object

MultitaskRegressor

class deepchem.models.MultitaskRegressor(n_tasks, n_features, layer_sizes=[1000], weight_init_stddevs=0.02, bias_init_consts=1.0, weight_decay_penalty=0.0, weight_decay_penalty_type='l2', dropouts=0.5, activation_fns=<function relu>, uncertainty=False, residual=False, **kwargs)[source]

A fully connected network for multitask regression.

This class provides lots of options for customizing aspects of the model: the number and widths of layers, the activation functions, regularization methods, etc.

It optionally can compose the model from pre-activation residual blocks, as described in https://arxiv.org/abs/1603.05027, rather than a simple stack of dense layers. This often leads to easier training, especially when using a large number of layers. Note that residual blocks can only be used when successive layers have the same width. Wherever the layer width changes, a simple dense layer will be used even if residual=True.

__init__(n_tasks, n_features, layer_sizes=[1000], weight_init_stddevs=0.02, bias_init_consts=1.0, weight_decay_penalty=0.0, weight_decay_penalty_type='l2', dropouts=0.5, activation_fns=<function relu>, uncertainty=False, residual=False, **kwargs)[source]

Create a MultitaskRegressor.

In addition to the following arguments, this class also accepts all the keywork arguments from TensorGraph.

Parameters:
  • n_tasks (int) – number of tasks
  • n_features (int) – number of features
  • layer_sizes (list) – the size of each dense layer in the network. The length of this list determines the number of layers.
  • weight_init_stddevs (list or float) – the standard deviation of the distribution to use for weight initialization of each layer. The length of this list should equal len(layer_sizes)+1. The final element corresponds to the output layer. Alternatively this may be a single value instead of a list, in which case the same value is used for every layer.
  • bias_init_consts (list or float) – the value to initialize the biases in each layer to. The length of this list should equal len(layer_sizes)+1. The final element corresponds to the output layer. Alternatively this may be a single value instead of a list, in which case the same value is used for every layer.
  • weight_decay_penalty (float) – the magnitude of the weight decay penalty to use
  • weight_decay_penalty_type (str) – the type of penalty to use for weight decay, either ‘l1’ or ‘l2’
  • dropouts (list or float) – the dropout probablity to use for each layer. The length of this list should equal len(layer_sizes). Alternatively this may be a single value instead of a list, in which case the same value is used for every layer.
  • activation_fns (list or object) – the Tensorflow activation function to apply to each layer. The length of this list should equal len(layer_sizes). Alternatively this may be a single value instead of a list, in which case the same value is used for every layer.
  • uncertainty (bool) – if True, include extra outputs and loss terms to enable the uncertainty in outputs to be predicted
  • residual (bool) – if True, the model will be composed of pre-activation residual blocks instead of a simple stack of dense layers.
compute_saliency(X)[source]

Compute the saliency map for an input sample.

This computes the Jacobian matrix with the derivative of each output element with respect to each input element. More precisely,

  • If this model has a single output, it returns a matrix of shape (output_shape, input_shape) with the derivatives.
  • If this model has multiple outputs, it returns a list of matrices, one for each output.

This method cannot be used on models that take multiple inputs.

Parameters:X (ndarray) – the input data for a single sample
Returns:
Return type:the Jacobian matrix, or a list of matrices
default_generator(dataset, epochs=1, mode='fit', deterministic=True, pad_batches=True)[source]

Create a generator that iterates batches for a dataset.

Subclasses may override this method to customize how model inputs are generated from the data.

Parameters:
  • dataset (Dataset) – the data to iterate
  • epochs (int) – the number of times to iterate over the full dataset
  • mode (str) – allowed values are ‘fit’ (called during training), ‘predict’ (called during prediction), and ‘uncertainty’ (called during uncertainty prediction)
  • deterministic (bool) – whether to iterate over the dataset in order, or randomly shuffle the data for each epoch
  • pad_batches (bool) – whether to pad each batch up to this model’s preferred batch size
Returns:

  • a generator that iterates batches, each represented as a tuple of lists
  • ([inputs], [outputs], [weights])

evaluate(dataset, metrics, transformers=[], per_task_metrics=False)[source]

Evaluates the performance of this model on specified dataset.

Parameters:
  • dataset (dc.data.Dataset) – Dataset object.
  • metric (deepchem.metrics.Metric) – Evaluation metric
  • transformers (list) – List of deepchem.transformers.Transformer
  • per_task_metrics (bool) – If True, return per-task scores.
Returns:

Maps tasks to scores under metric.

Return type:

dict

evaluate_generator(generator, metrics, transformers=[], per_task_metrics=False)[source]

Evaluate the performance of this model on the data produced by a generator.

Parameters:
  • generator (generator) – this should generate batches, each represented as a tuple of the form (inputs, labels, weights).
  • metric (deepchem.metrics.Metric) – Evaluation metric
  • transformers (list of dc.trans.Transformers) – Transformers that the input data has been transformed by. The output is passed through these transformers to undo the transformations.
  • per_task_metrics (bool) – If True, return per-task scores.
Returns:

Maps tasks to scores under metric.

Return type:

dict

fit(dataset, nb_epoch=10, max_checkpoints_to_keep=5, checkpoint_interval=1000, deterministic=False, restore=False, variables=None, loss=None, callbacks=[])[source]

Train this model on a dataset.

Parameters:
  • dataset (Dataset) – the Dataset to train on
  • nb_epoch (int) – the number of epochs to train for
  • max_checkpoints_to_keep (int) – the maximum number of checkpoints to keep. Older checkpoints are discarded.
  • checkpoint_interval (int) – the frequency at which to write checkpoints, measured in training steps. Set this to 0 to disable automatic checkpointing.
  • deterministic (bool) – if True, the samples are processed in order. If False, a different random order is used for each epoch.
  • restore (bool) – if True, restore the model from the most recent checkpoint and continue training from there. If False, retrain the model from scratch.
  • variables (list of tf.Variable) – the variables to train. If None (the default), all trainable variables in the model are used.
  • loss (function) – a function of the form f(outputs, labels, weights) that computes the loss for each batch. If None (the default), the model’s standard loss function is used.
  • callbacks (function or list of functions) – one or more functions of the form f(model, step) that will be invoked after every step. This can be used to perform validation, logging, etc.
fit_generator(generator, max_checkpoints_to_keep=5, checkpoint_interval=1000, restore=False, variables=None, loss=None, callbacks=[])[source]

Train this model on data from a generator.

Parameters:
  • generator (generator) – this should generate batches, each represented as a tuple of the form (inputs, labels, weights).
  • max_checkpoints_to_keep (int) – the maximum number of checkpoints to keep. Older checkpoints are discarded.
  • checkpoint_interval (int) – the frequency at which to write checkpoints, measured in training steps. Set this to 0 to disable automatic checkpointing.
  • restore (bool) – if True, restore the model from the most recent checkpoint and continue training from there. If False, retrain the model from scratch.
  • variables (list of tf.Variable) – the variables to train. If None (the default), all trainable variables in the model are used.
  • loss (function) – a function of the form f(outputs, labels, weights) that computes the loss for each batch. If None (the default), the model’s standard loss function is used.
  • callbacks (function or list of functions) – one or more functions of the form f(model, step) that will be invoked after every step. This can be used to perform validation, logging, etc.
Returns:

Return type:

the average loss over the most recent checkpoint interval

fit_on_batch(X, y, w, variables=None, loss=None, callbacks=[], checkpoint=True, max_checkpoints_to_keep=5)[source]

Perform a single step of training.

Parameters:
  • X (ndarray) – the inputs for the batch
  • y (ndarray) – the labels for the batch
  • w (ndarray) – the weights for the batch
  • variables (list of tf.Variable) – the variables to train. If None (the default), all trainable variables in the model are used.
  • loss (function) – a function of the form f(outputs, labels, weights) that computes the loss for each batch. If None (the default), the model’s standard loss function is used.
  • callbacks (function or list of functions) – one or more functions of the form f(model, step) that will be invoked after every step. This can be used to perform validation, logging, etc.
  • checkpoint (bool) – if true, save a checkpoint after performing the training step
  • max_checkpoints_to_keep (int) – the maximum number of checkpoints to keep. Older checkpoints are discarded.
get_checkpoints(model_dir=None)[source]

Get a list of all available checkpoint files.

Parameters:model_dir (str, default None) – Directory to get list of checkpoints from. Reverts to self.model_dir if None
get_global_step()[source]

Get the number of steps of fitting that have been performed.

static get_model_filename(model_dir)[source]

Given model directory, obtain filename for the model itself.

get_num_tasks()[source]

Get number of tasks.

get_params(deep=True)[source]

Get parameters for this estimator.

Parameters:deep (bool, default=True) – If True, will return the parameters for this estimator and contained subobjects that are estimators.
Returns:params – Parameter names mapped to their values.
Return type:mapping of string to any
static get_params_filename(model_dir)[source]

Given model directory, obtain filename for the model itself.

get_task_type()[source]

Currently models can only be classifiers or regressors.

load_from_pretrained(source_model, assignment_map=None, value_map=None, checkpoint=None, model_dir=None, include_top=True, inputs=None, **kwargs)[source]

Copies variable values from a pretrained model. source_model can either be a pretrained model or a model with the same architecture. value_map is a variable-value dictionary. If no value_map is provided, the variable values are restored to the source_model from a checkpoint and a default value_map is created. assignment_map is a dictionary mapping variables from the source_model to the current model. If no assignment_map is provided, one is made from scratch and assumes the model is composed of several different layers, with the final one being a dense layer. include_top is used to control whether or not the final dense layer is used. The default assignment map is useful in cases where the type of task is different (classification vs regression) and/or number of tasks in the setting.

Parameters:
  • source_model (dc.KerasModel, required) – source_model can either be the pretrained model or a dc.KerasModel with the same architecture as the pretrained model. It is used to restore from a checkpoint, if value_map is None and to create a default assignment map if assignment_map is None
  • assignment_map (Dict, default None) – Dictionary mapping the source_model variables and current model variables
  • value_map (Dict, default None) – Dictionary containing source_model trainable variables mapped to numpy arrays. If value_map is None, the values are restored and a default variable map is created using the restored values
  • checkpoint (str, default None) – the path to the checkpoint file to load. If this is None, the most recent checkpoint will be chosen automatically. Call get_checkpoints() to get a list of all available checkpoints
  • model_dir (str, default None) – Restore model from custom model directory if needed
  • include_top (bool, default True) – if True, copies the weights and bias associated with the final dense layer. Used only when assignment map is None
  • inputs (List, input tensors for model) – if not None, then the weights are built for both the source and self. This option is useful only for models that are built by subclassing tf.keras.Model, and not using the functional API by tf.keras
predict(dataset, transformers=[], outputs=None, output_types=None)[source]

Uses self to make predictions on provided Dataset object.

Parameters:
  • dataset (dc.data.Dataset) – Dataset to make prediction on
  • transformers (list of dc.trans.Transformers) – Transformers that the input data has been transformed by. The output is passed through these transformers to undo the transformations.
  • outputs (Tensor or list of Tensors) – The outputs to return. If this is None, the model’s standard prediction outputs will be returned. Alternatively one or more Tensors within the model may be specified, in which case the output of those Tensors will be returned.
  • output_types (list of Strings) – The output types to return. Will retrieve all outputs of these types from the model.
Returns:

  • a NumPy array of the model produces a single output, or a list of arrays
  • if it produces multiple outputs

predict_embedding(dataset)[source]

Predicts embeddings created by underlying model if any exist. An embedding must be specified to have output_type of ‘embedding’ in the model definition.

Parameters:dataset (dc.data.Dataset) – Dataset to make prediction on
Returns:
  • a NumPy array of the embeddings model produces, or a list
  • of arrays if it produces multiple embeddings
predict_on_batch(X, transformers=[], outputs=None)[source]

Generates predictions for input samples, processing samples in a batch.

Parameters:
  • X (ndarray) – the input data, as a Numpy array.
  • transformers (list of dc.trans.Transformers) – Transformers that the input data has been transformed by. The output is passed through these transformers to undo the transformations.
  • outputs (Tensor or list of Tensors) – The outputs to return. If this is None, the model’s standard prediction outputs will be returned. Alternatively one or more Tensors within the model may be specified, in which case the output of those Tensors will be returned.
Returns:

  • a NumPy array of the model produces a single output, or a list of arrays
  • if it produces multiple outputs

predict_on_generator(generator, transformers=[], outputs=None, output_types=None)[source]
Parameters:
  • generator (generator) – this should generate batches, each represented as a tuple of the form (inputs, labels, weights).
  • transformers (list of dc.trans.Transformers) – Transformers that the input data has been transformed by. The output is passed through these transformers to undo the transformations.
  • outputs (Tensor or list of Tensors) – The outputs to return. If this is None, the model’s standard prediction outputs will be returned. Alternatively one or more Tensors within the model may be specified, in which case the output of those Tensors will be returned. If outputs is specified, output_types must be None.
  • output_types (String or list of Strings) – If specified, all outputs of this type will be retrieved from the model. If output_types is specified, outputs must be None.
  • Returns – a NumPy array of the model produces a single output, or a list of arrays if it produces multiple outputs
predict_uncertainty(dataset, masks=50)[source]

Predict the model’s outputs, along with the uncertainty in each one.

The uncertainty is computed as described in https://arxiv.org/abs/1703.04977. It involves repeating the prediction many times with different dropout masks. The prediction is computed as the average over all the predictions. The uncertainty includes both the variation among the predicted values (epistemic uncertainty) and the model’s own estimates for how well it fits the data (aleatoric uncertainty). Not all models support uncertainty prediction.

Parameters:
  • dataset (dc.data.Dataset) – Dataset to make prediction on
  • masks (int) – the number of dropout masks to average over
Returns:

  • for each output, a tuple (y_pred, y_std) where y_pred is the predicted
  • value of the output, and each element of y_std estimates the standard
  • deviation of the corresponding element of y_pred

predict_uncertainty_on_batch(X, masks=50)[source]

Predict the model’s outputs, along with the uncertainty in each one.

The uncertainty is computed as described in https://arxiv.org/abs/1703.04977. It involves repeating the prediction many times with different dropout masks. The prediction is computed as the average over all the predictions. The uncertainty includes both the variation among the predicted values (epistemic uncertainty) and the model’s own estimates for how well it fits the data (aleatoric uncertainty). Not all models support uncertainty prediction.

Parameters:
  • X (ndarray) – the input data, as a Numpy array.
  • masks (int) – the number of dropout masks to average over
Returns:

  • for each output, a tuple (y_pred, y_std) where y_pred is the predicted
  • value of the output, and each element of y_std estimates the standard
  • deviation of the corresponding element of y_pred

reload()[source]

Reload trained model from disk.

restore(checkpoint=None, model_dir=None, session=None)[source]

Reload the values of all variables from a checkpoint file.

Parameters:
  • checkpoint (str) – the path to the checkpoint file to load. If this is None, the most recent checkpoint will be chosen automatically. Call get_checkpoints() to get a list of all available checkpoints.
  • model_dir (str, default None) – Directory to restore checkpoint from. If None, use self.model_dir.
  • session (tf.Session(), default None) – Session to run restore ops under. If None, self.session is used.
save()[source]

Dispatcher function for saving.

Each subclass is responsible for overriding this method.

save_checkpoint(max_checkpoints_to_keep=5, model_dir=None)[source]

Save a checkpoint to disk.

Usually you do not need to call this method, since fit() saves checkpoints automatically. If you have disabled automatic checkpointing during fitting, this can be called to manually write checkpoints.

Parameters:
  • max_checkpoints_to_keep (int) – the maximum number of checkpoints to keep. Older checkpoints are discarded.
  • model_dir (str, default None) – Model directory to save checkpoint to. If None, revert to self.model_dir
set_params(**params)[source]

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters:**params (dict) – Estimator parameters.
Returns:self – Estimator instance.
Return type:object

MultitaskFitTransformRegressor

class deepchem.models.MultitaskFitTransformRegressor(n_tasks, n_features, fit_transformers=[], batch_size=50, **kwargs)[source]

Implements a MultitaskRegressor that performs on-the-fly transformation during fit/predict.

Example:

>>> n_samples = 10
>>> n_features = 3
>>> n_tasks = 1
>>> ids = np.arange(n_samples)
>>> X = np.random.rand(n_samples, n_features, n_features)
>>> y = np.zeros((n_samples, n_tasks))
>>> w = np.ones((n_samples, n_tasks))
>>> dataset = dc.data.NumpyDataset(X, y, w, ids)
>>> fit_transformers = [dc.trans.CoulombFitTransformer(dataset)]
>>> model = dc.models.MultitaskFitTransformRegressor(n_tasks, [n_features, n_features],
...     dropouts=[0.], learning_rate=0.003, weight_init_stddevs=[np.sqrt(6)/np.sqrt(1000)],
...     batch_size=n_samples, fit_transformers=fit_transformers)
>>> model.n_features
12
__init__(n_tasks, n_features, fit_transformers=[], batch_size=50, **kwargs)[source]

Create a MultitaskFitTransformRegressor.

In addition to the following arguments, this class also accepts all the keywork arguments from MultitaskRegressor.

Parameters:
  • n_tasks (int) – number of tasks
  • n_features (list or int) – number of features
  • fit_transformers (list) – List of dc.trans.FitTransformer objects
compute_saliency(X)[source]

Compute the saliency map for an input sample.

This computes the Jacobian matrix with the derivative of each output element with respect to each input element. More precisely,

  • If this model has a single output, it returns a matrix of shape (output_shape, input_shape) with the derivatives.
  • If this model has multiple outputs, it returns a list of matrices, one for each output.

This method cannot be used on models that take multiple inputs.

Parameters:X (ndarray) – the input data for a single sample
Returns:
Return type:the Jacobian matrix, or a list of matrices
default_generator(dataset, epochs=1, mode='fit', deterministic=True, pad_batches=True)[source]

Create a generator that iterates batches for a dataset.

Subclasses may override this method to customize how model inputs are generated from the data.

Parameters:
  • dataset (Dataset) – the data to iterate
  • epochs (int) – the number of times to iterate over the full dataset
  • mode (str) – allowed values are ‘fit’ (called during training), ‘predict’ (called during prediction), and ‘uncertainty’ (called during uncertainty prediction)
  • deterministic (bool) – whether to iterate over the dataset in order, or randomly shuffle the data for each epoch
  • pad_batches (bool) – whether to pad each batch up to this model’s preferred batch size
Returns:

  • a generator that iterates batches, each represented as a tuple of lists
  • ([inputs], [outputs], [weights])

evaluate(dataset, metrics, transformers=[], per_task_metrics=False)[source]

Evaluates the performance of this model on specified dataset.

Parameters:
  • dataset (dc.data.Dataset) – Dataset object.
  • metric (deepchem.metrics.Metric) – Evaluation metric
  • transformers (list) – List of deepchem.transformers.Transformer
  • per_task_metrics (bool) – If True, return per-task scores.
Returns:

Maps tasks to scores under metric.

Return type:

dict

evaluate_generator(generator, metrics, transformers=[], per_task_metrics=False)[source]

Evaluate the performance of this model on the data produced by a generator.

Parameters:
  • generator (generator) – this should generate batches, each represented as a tuple of the form (inputs, labels, weights).
  • metric (deepchem.metrics.Metric) – Evaluation metric
  • transformers (list of dc.trans.Transformers) – Transformers that the input data has been transformed by. The output is passed through these transformers to undo the transformations.
  • per_task_metrics (bool) – If True, return per-task scores.
Returns:

Maps tasks to scores under metric.

Return type:

dict

fit(dataset, nb_epoch=10, max_checkpoints_to_keep=5, checkpoint_interval=1000, deterministic=False, restore=False, variables=None, loss=None, callbacks=[])[source]

Train this model on a dataset.

Parameters:
  • dataset (Dataset) – the Dataset to train on
  • nb_epoch (int) – the number of epochs to train for
  • max_checkpoints_to_keep (int) – the maximum number of checkpoints to keep. Older checkpoints are discarded.
  • checkpoint_interval (int) – the frequency at which to write checkpoints, measured in training steps. Set this to 0 to disable automatic checkpointing.
  • deterministic (bool) – if True, the samples are processed in order. If False, a different random order is used for each epoch.
  • restore (bool) – if True, restore the model from the most recent checkpoint and continue training from there. If False, retrain the model from scratch.
  • variables (list of tf.Variable) – the variables to train. If None (the default), all trainable variables in the model are used.
  • loss (function) – a function of the form f(outputs, labels, weights) that computes the loss for each batch. If None (the default), the model’s standard loss function is used.
  • callbacks (function or list of functions) – one or more functions of the form f(model, step) that will be invoked after every step. This can be used to perform validation, logging, etc.
fit_generator(generator, max_checkpoints_to_keep=5, checkpoint_interval=1000, restore=False, variables=None, loss=None, callbacks=[])[source]

Train this model on data from a generator.

Parameters:
  • generator (generator) – this should generate batches, each represented as a tuple of the form (inputs, labels, weights).
  • max_checkpoints_to_keep (int) – the maximum number of checkpoints to keep. Older checkpoints are discarded.
  • checkpoint_interval (int) – the frequency at which to write checkpoints, measured in training steps. Set this to 0 to disable automatic checkpointing.
  • restore (bool) – if True, restore the model from the most recent checkpoint and continue training from there. If False, retrain the model from scratch.
  • variables (list of tf.Variable) – the variables to train. If None (the default), all trainable variables in the model are used.
  • loss (function) – a function of the form f(outputs, labels, weights) that computes the loss for each batch. If None (the default), the model’s standard loss function is used.
  • callbacks (function or list of functions) – one or more functions of the form f(model, step) that will be invoked after every step. This can be used to perform validation, logging, etc.
Returns:

Return type:

the average loss over the most recent checkpoint interval

fit_on_batch(X, y, w, variables=None, loss=None, callbacks=[], checkpoint=True, max_checkpoints_to_keep=5)[source]

Perform a single step of training.

Parameters:
  • X (ndarray) – the inputs for the batch
  • y (ndarray) – the labels for the batch
  • w (ndarray) – the weights for the batch
  • variables (list of tf.Variable) – the variables to train. If None (the default), all trainable variables in the model are used.
  • loss (function) – a function of the form f(outputs, labels, weights) that computes the loss for each batch. If None (the default), the model’s standard loss function is used.
  • callbacks (function or list of functions) – one or more functions of the form f(model, step) that will be invoked after every step. This can be used to perform validation, logging, etc.
  • checkpoint (bool) – if true, save a checkpoint after performing the training step
  • max_checkpoints_to_keep (int) – the maximum number of checkpoints to keep. Older checkpoints are discarded.
get_checkpoints(model_dir=None)[source]

Get a list of all available checkpoint files.

Parameters:model_dir (str, default None) – Directory to get list of checkpoints from. Reverts to self.model_dir if None
get_global_step()[source]

Get the number of steps of fitting that have been performed.

static get_model_filename(model_dir)[source]

Given model directory, obtain filename for the model itself.

get_num_tasks()[source]

Get number of tasks.

get_params(deep=True)[source]

Get parameters for this estimator.

Parameters:deep (bool, default=True) – If True, will return the parameters for this estimator and contained subobjects that are estimators.
Returns:params – Parameter names mapped to their values.
Return type:mapping of string to any
static get_params_filename(model_dir)[source]

Given model directory, obtain filename for the model itself.

get_task_type()[source]

Currently models can only be classifiers or regressors.

load_from_pretrained(source_model, assignment_map=None, value_map=None, checkpoint=None, model_dir=None, include_top=True, inputs=None, **kwargs)[source]

Copies variable values from a pretrained model. source_model can either be a pretrained model or a model with the same architecture. value_map is a variable-value dictionary. If no value_map is provided, the variable values are restored to the source_model from a checkpoint and a default value_map is created. assignment_map is a dictionary mapping variables from the source_model to the current model. If no assignment_map is provided, one is made from scratch and assumes the model is composed of several different layers, with the final one being a dense layer. include_top is used to control whether or not the final dense layer is used. The default assignment map is useful in cases where the type of task is different (classification vs regression) and/or number of tasks in the setting.

Parameters:
  • source_model (dc.KerasModel, required) – source_model can either be the pretrained model or a dc.KerasModel with the same architecture as the pretrained model. It is used to restore from a checkpoint, if value_map is None and to create a default assignment map if assignment_map is None
  • assignment_map (Dict, default None) – Dictionary mapping the source_model variables and current model variables
  • value_map (Dict, default None) – Dictionary containing source_model trainable variables mapped to numpy arrays. If value_map is None, the values are restored and a default variable map is created using the restored values
  • checkpoint (str, default None) – the path to the checkpoint file to load. If this is None, the most recent checkpoint will be chosen automatically. Call get_checkpoints() to get a list of all available checkpoints
  • model_dir (str, default None) – Restore model from custom model directory if needed
  • include_top (bool, default True) – if True, copies the weights and bias associated with the final dense layer. Used only when assignment map is None
  • inputs (List, input tensors for model) – if not None, then the weights are built for both the source and self. This option is useful only for models that are built by subclassing tf.keras.Model, and not using the functional API by tf.keras
predict(dataset, transformers=[], outputs=None, output_types=None)[source]

Uses self to make predictions on provided Dataset object.

Parameters:
  • dataset (dc.data.Dataset) – Dataset to make prediction on
  • transformers (list of dc.trans.Transformers) – Transformers that the input data has been transformed by. The output is passed through these transformers to undo the transformations.
  • outputs (Tensor or list of Tensors) – The outputs to return. If this is None, the model’s standard prediction outputs will be returned. Alternatively one or more Tensors within the model may be specified, in which case the output of those Tensors will be returned.
  • output_types (list of Strings) – The output types to return. Will retrieve all outputs of these types from the model.
Returns:

  • a NumPy array of the model produces a single output, or a list of arrays
  • if it produces multiple outputs

predict_embedding(dataset)[source]

Predicts embeddings created by underlying model if any exist. An embedding must be specified to have output_type of ‘embedding’ in the model definition.

Parameters:dataset (dc.data.Dataset) – Dataset to make prediction on
Returns:
  • a NumPy array of the embeddings model produces, or a list
  • of arrays if it produces multiple embeddings
predict_on_batch(X, transformers=[], outputs=None)[source]

Generates predictions for input samples, processing samples in a batch.

Parameters:
  • X (ndarray) – the input data, as a Numpy array.
  • transformers (list of dc.trans.Transformers) – Transformers that the input data has been transformed by. The output is passed through these transformers to undo the transformations.
  • outputs (Tensor or list of Tensors) – The outputs to return. If this is None, the model’s standard prediction outputs will be returned. Alternatively one or more Tensors within the model may be specified, in which case the output of those Tensors will be returned.
Returns:

  • a NumPy array of the model produces a single output, or a list of arrays
  • if it produces multiple outputs

predict_on_generator(generator, transformers=[], outputs=None, output_types=None)[source]
Parameters:
  • generator (generator) – this should generate batches, each represented as a tuple of the form (inputs, labels, weights).
  • transformers (list of dc.trans.Transformers) – Transformers that the input data has been transformed by. The output is passed through these transformers to undo the transformations.
  • outputs (Tensor or list of Tensors) – The outputs to return. If this is None, the model’s standard prediction outputs will be returned. Alternatively one or more Tensors within the model may be specified, in which case the output of those Tensors will be returned. If outputs is specified, output_types must be None.
  • output_types (String or list of Strings) – If specified, all outputs of this type will be retrieved from the model. If output_types is specified, outputs must be None.
  • Returns – a NumPy array of the model produces a single output, or a list of arrays if it produces multiple outputs
predict_uncertainty(dataset, masks=50)[source]

Predict the model’s outputs, along with the uncertainty in each one.

The uncertainty is computed as described in https://arxiv.org/abs/1703.04977. It involves repeating the prediction many times with different dropout masks. The prediction is computed as the average over all the predictions. The uncertainty includes both the variation among the predicted values (epistemic uncertainty) and the model’s own estimates for how well it fits the data (aleatoric uncertainty). Not all models support uncertainty prediction.

Parameters:
  • dataset (dc.data.Dataset) – Dataset to make prediction on
  • masks (int) – the number of dropout masks to average over
Returns:

  • for each output, a tuple (y_pred, y_std) where y_pred is the predicted
  • value of the output, and each element of y_std estimates the standard
  • deviation of the corresponding element of y_pred

predict_uncertainty_on_batch(X, masks=50)[source]

Predict the model’s outputs, along with the uncertainty in each one.

The uncertainty is computed as described in https://arxiv.org/abs/1703.04977. It involves repeating the prediction many times with different dropout masks. The prediction is computed as the average over all the predictions. The uncertainty includes both the variation among the predicted values (epistemic uncertainty) and the model’s own estimates for how well it fits the data (aleatoric uncertainty). Not all models support uncertainty prediction.

Parameters:
  • X (ndarray) – the input data, as a Numpy array.
  • masks (int) – the number of dropout masks to average over
Returns:

  • for each output, a tuple (y_pred, y_std) where y_pred is the predicted
  • value of the output, and each element of y_std estimates the standard
  • deviation of the corresponding element of y_pred

reload()[source]

Reload trained model from disk.

restore(checkpoint=None, model_dir=None, session=None)[source]

Reload the values of all variables from a checkpoint file.

Parameters:
  • checkpoint (str) – the path to the checkpoint file to load. If this is None, the most recent checkpoint will be chosen automatically. Call get_checkpoints() to get a list of all available checkpoints.
  • model_dir (str, default None) – Directory to restore checkpoint from. If None, use self.model_dir.
  • session (tf.Session(), default None) – Session to run restore ops under. If None, self.session is used.
save()[source]

Dispatcher function for saving.

Each subclass is responsible for overriding this method.

save_checkpoint(max_checkpoints_to_keep=5, model_dir=None)[source]

Save a checkpoint to disk.

Usually you do not need to call this method, since fit() saves checkpoints automatically. If you have disabled automatic checkpointing during fitting, this can be called to manually write checkpoints.

Parameters:
  • max_checkpoints_to_keep (int) – the maximum number of checkpoints to keep. Older checkpoints are discarded.
  • model_dir (str, default None) – Model directory to save checkpoint to. If None, revert to self.model_dir
set_params(**params)[source]

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters:**params (dict) – Estimator parameters.
Returns:self – Estimator instance.
Return type:object

MultitaskClassifier

class deepchem.models.MultitaskClassifier(n_tasks, n_features, layer_sizes=[1000], weight_init_stddevs=0.02, bias_init_consts=1.0, weight_decay_penalty=0.0, weight_decay_penalty_type='l2', dropouts=0.5, activation_fns=<function relu>, n_classes=2, residual=False, **kwargs)[source]

A fully connected network for multitask classification.

This class provides lots of options for customizing aspects of the model: the number and widths of layers, the activation functions, regularization methods, etc.

It optionally can compose the model from pre-activation residual blocks, as described in https://arxiv.org/abs/1603.05027, rather than a simple stack of dense layers. This often leads to easier training, especially when using a large number of layers. Note that residual blocks can only be used when successive layers have the same width. Wherever the layer width changes, a simple dense layer will be used even if residual=True.

__init__(n_tasks, n_features, layer_sizes=[1000], weight_init_stddevs=0.02, bias_init_consts=1.0, weight_decay_penalty=0.0, weight_decay_penalty_type='l2', dropouts=0.5, activation_fns=<function relu>, n_classes=2, residual=False, **kwargs)[source]

Create a MultitaskClassifier.

In addition to the following arguments, this class also accepts all the keyword arguments from TensorGraph.

Parameters:
  • n_tasks (int) – number of tasks
  • n_features (int) – number of features
  • layer_sizes (list) – the size of each dense layer in the network. The length of this list determines the number of layers.
  • weight_init_stddevs (list or float) – the standard deviation of the distribution to use for weight initialization of each layer. The length of this list should equal len(layer_sizes). Alternatively this may be a single value instead of a list, in which case the same value is used for every layer.
  • bias_init_consts (list or loat) – the value to initialize the biases in each layer to. The length of this list should equal len(layer_sizes). Alternatively this may be a single value instead of a list, in which case the same value is used for every layer.
  • weight_decay_penalty (float) – the magnitude of the weight decay penalty to use
  • weight_decay_penalty_type (str) – the type of penalty to use for weight decay, either ‘l1’ or ‘l2’
  • dropouts (list or float) – the dropout probablity to use for each layer. The length of this list should equal len(layer_sizes). Alternatively this may be a single value instead of a list, in which case the same value is used for every layer.
  • activation_fns (list or object) – the Tensorflow activation function to apply to each layer. The length of this list should equal len(layer_sizes). Alternatively this may be a single value instead of a list, in which case the same value is used for every layer.
  • n_classes (int) – the number of classes
  • residual (bool) – if True, the model will be composed of pre-activation residual blocks instead of a simple stack of dense layers.
compute_saliency(X)[source]

Compute the saliency map for an input sample.

This computes the Jacobian matrix with the derivative of each output element with respect to each input element. More precisely,

  • If this model has a single output, it returns a matrix of shape (output_shape, input_shape) with the derivatives.
  • If this model has multiple outputs, it returns a list of matrices, one for each output.

This method cannot be used on models that take multiple inputs.

Parameters:X (ndarray) – the input data for a single sample
Returns:
Return type:the Jacobian matrix, or a list of matrices
default_generator(dataset, epochs=1, mode='fit', deterministic=True, pad_batches=True)[source]

Create a generator that iterates batches for a dataset.

Subclasses may override this method to customize how model inputs are generated from the data.

Parameters:
  • dataset (Dataset) – the data to iterate
  • epochs (int) – the number of times to iterate over the full dataset
  • mode (str) – allowed values are ‘fit’ (called during training), ‘predict’ (called during prediction), and ‘uncertainty’ (called during uncertainty prediction)
  • deterministic (bool) – whether to iterate over the dataset in order, or randomly shuffle the data for each epoch
  • pad_batches (bool) – whether to pad each batch up to this model’s preferred batch size
Returns:

  • a generator that iterates batches, each represented as a tuple of lists
  • ([inputs], [outputs], [weights])

evaluate(dataset, metrics, transformers=[], per_task_metrics=False)[source]

Evaluates the performance of this model on specified dataset.

Parameters:
  • dataset (dc.data.Dataset) – Dataset object.
  • metric (deepchem.metrics.Metric) – Evaluation metric
  • transformers (list) – List of deepchem.transformers.Transformer
  • per_task_metrics (bool) – If True, return per-task scores.
Returns:

Maps tasks to scores under metric.

Return type:

dict

evaluate_generator(generator, metrics, transformers=[], per_task_metrics=False)[source]

Evaluate the performance of this model on the data produced by a generator.

Parameters:
  • generator (generator) – this should generate batches, each represented as a tuple of the form (inputs, labels, weights).
  • metric (deepchem.metrics.Metric) – Evaluation metric
  • transformers (list of dc.trans.Transformers) – Transformers that the input data has been transformed by. The output is passed through these transformers to undo the transformations.
  • per_task_metrics (bool) – If True, return per-task scores.
Returns:

Maps tasks to scores under metric.

Return type:

dict

fit(dataset, nb_epoch=10, max_checkpoints_to_keep=5, checkpoint_interval=1000, deterministic=False, restore=False, variables=None, loss=None, callbacks=[])[source]

Train this model on a dataset.

Parameters:
  • dataset (Dataset) – the Dataset to train on
  • nb_epoch (int) – the number of epochs to train for
  • max_checkpoints_to_keep (int) – the maximum number of checkpoints to keep. Older checkpoints are discarded.
  • checkpoint_interval (int) – the frequency at which to write checkpoints, measured in training steps. Set this to 0 to disable automatic checkpointing.
  • deterministic (bool) – if True, the samples are processed in order. If False, a different random order is used for each epoch.
  • restore (bool) – if True, restore the model from the most recent checkpoint and continue training from there. If False, retrain the model from scratch.
  • variables (list of tf.Variable) – the variables to train. If None (the default), all trainable variables in the model are used.
  • loss (function) – a function of the form f(outputs, labels, weights) that computes the loss for each batch. If None (the default), the model’s standard loss function is used.
  • callbacks (function or list of functions) – one or more functions of the form f(model, step) that will be invoked after every step. This can be used to perform validation, logging, etc.
fit_generator(generator, max_checkpoints_to_keep=5, checkpoint_interval=1000, restore=False, variables=None, loss=None, callbacks=[])[source]

Train this model on data from a generator.

Parameters:
  • generator (generator) – this should generate batches, each represented as a tuple of the form (inputs, labels, weights).
  • max_checkpoints_to_keep (int) – the maximum number of checkpoints to keep. Older checkpoints are discarded.
  • checkpoint_interval (int) – the frequency at which to write checkpoints, measured in training steps. Set this to 0 to disable automatic checkpointing.
  • restore (bool) – if True, restore the model from the most recent checkpoint and continue training from there. If False, retrain the model from scratch.
  • variables (list of tf.Variable) – the variables to train. If None (the default), all trainable variables in the model are used.
  • loss (function) – a function of the form f(outputs, labels, weights) that computes the loss for each batch. If None (the default), the model’s standard loss function is used.
  • callbacks (function or list of functions) – one or more functions of the form f(model, step) that will be invoked after every step. This can be used to perform validation, logging, etc.
Returns:

Return type:

the average loss over the most recent checkpoint interval

fit_on_batch(X, y, w, variables=None, loss=None, callbacks=[], checkpoint=True, max_checkpoints_to_keep=5)[source]

Perform a single step of training.

Parameters:
  • X (ndarray) – the inputs for the batch
  • y (ndarray) – the labels for the batch
  • w (ndarray) – the weights for the batch
  • variables (list of tf.Variable) – the variables to train. If None (the default), all trainable variables in the model are used.
  • loss (function) – a function of the form f(outputs, labels, weights) that computes the loss for each batch. If None (the default), the model’s standard loss function is used.
  • callbacks (function or list of functions) – one or more functions of the form f(model, step) that will be invoked after every step. This can be used to perform validation, logging, etc.
  • checkpoint (bool) – if true, save a checkpoint after performing the training step
  • max_checkpoints_to_keep (int) – the maximum number of checkpoints to keep. Older checkpoints are discarded.
get_checkpoints(model_dir=None)[source]

Get a list of all available checkpoint files.

Parameters:model_dir (str, default None) – Directory to get list of checkpoints from. Reverts to self.model_dir if None
get_global_step()[source]

Get the number of steps of fitting that have been performed.

static get_model_filename(model_dir)[source]

Given model directory, obtain filename for the model itself.

get_num_tasks()[source]

Get number of tasks.

get_params(deep=True)[source]

Get parameters for this estimator.

Parameters:deep (bool, default=True) – If True, will return the parameters for this estimator and contained subobjects that are estimators.
Returns:params – Parameter names mapped to their values.
Return type:mapping of string to any
static get_params_filename(model_dir)[source]

Given model directory, obtain filename for the model itself.

get_task_type()[source]

Currently models can only be classifiers or regressors.

load_from_pretrained(source_model, assignment_map=None, value_map=None, checkpoint=None, model_dir=None, include_top=True, inputs=None, **kwargs)[source]

Copies variable values from a pretrained model. source_model can either be a pretrained model or a model with the same architecture. value_map is a variable-value dictionary. If no value_map is provided, the variable values are restored to the source_model from a checkpoint and a default value_map is created. assignment_map is a dictionary mapping variables from the source_model to the current model. If no assignment_map is provided, one is made from scratch and assumes the model is composed of several different layers, with the final one being a dense layer. include_top is used to control whether or not the final dense layer is used. The default assignment map is useful in cases where the type of task is different (classification vs regression) and/or number of tasks in the setting.

Parameters:
  • source_model (dc.KerasModel, required) – source_model can either be the pretrained model or a dc.KerasModel with the same architecture as the pretrained model. It is used to restore from a checkpoint, if value_map is None and to create a default assignment map if assignment_map is None
  • assignment_map (Dict, default None) – Dictionary mapping the source_model variables and current model variables
  • value_map (Dict, default None) – Dictionary containing source_model trainable variables mapped to numpy arrays. If value_map is None, the values are restored and a default variable map is created using the restored values
  • checkpoint (str, default None) – the path to the checkpoint file to load. If this is None, the most recent checkpoint will be chosen automatically. Call get_checkpoints() to get a list of all available checkpoints
  • model_dir (str, default None) – Restore model from custom model directory if needed
  • include_top (bool, default True) – if True, copies the weights and bias associated with the final dense layer. Used only when assignment map is None
  • inputs (List, input tensors for model) – if not None, then the weights are built for both the source and self. This option is useful only for models that are built by subclassing tf.keras.Model, and not using the functional API by tf.keras
predict(dataset, transformers=[], outputs=None, output_types=None)[source]

Uses self to make predictions on provided Dataset object.

Parameters:
  • dataset (dc.data.Dataset) – Dataset to make prediction on
  • transformers (list of dc.trans.Transformers) – Transformers that the input data has been transformed by. The output is passed through these transformers to undo the transformations.
  • outputs (Tensor or list of Tensors) – The outputs to return. If this is None, the model’s standard prediction outputs will be returned. Alternatively one or more Tensors within the model may be specified, in which case the output of those Tensors will be returned.
  • output_types (list of Strings) – The output types to return. Will retrieve all outputs of these types from the model.
Returns:

  • a NumPy array of the model produces a single output, or a list of arrays
  • if it produces multiple outputs

predict_embedding(dataset)[source]

Predicts embeddings created by underlying model if any exist. An embedding must be specified to have output_type of ‘embedding’ in the model definition.

Parameters:dataset (dc.data.Dataset) – Dataset to make prediction on
Returns:
  • a NumPy array of the embeddings model produces, or a list
  • of arrays if it produces multiple embeddings
predict_on_batch(X, transformers=[], outputs=None)[source]

Generates predictions for input samples, processing samples in a batch.

Parameters:
  • X (ndarray) – the input data, as a Numpy array.
  • transformers (list of dc.trans.Transformers) – Transformers that the input data has been transformed by. The output is passed through these transformers to undo the transformations.
  • outputs (Tensor or list of Tensors) – The outputs to return. If this is None, the model’s standard prediction outputs will be returned. Alternatively one or more Tensors within the model may be specified, in which case the output of those Tensors will be returned.
Returns:

  • a NumPy array of the model produces a single output, or a list of arrays
  • if it produces multiple outputs

predict_on_generator(generator, transformers=[], outputs=None, output_types=None)[source]
Parameters:
  • generator (generator) – this should generate batches, each represented as a tuple of the form (inputs, labels, weights).
  • transformers (list of dc.trans.Transformers) – Transformers that the input data has been transformed by. The output is passed through these transformers to undo the transformations.
  • outputs (Tensor or list of Tensors) – The outputs to return. If this is None, the model’s standard prediction outputs will be returned. Alternatively one or more Tensors within the model may be specified, in which case the output of those Tensors will be returned. If outputs is specified, output_types must be None.
  • output_types (String or list of Strings) – If specified, all outputs of this type will be retrieved from the model. If output_types is specified, outputs must be None.
  • Returns – a NumPy array of the model produces a single output, or a list of arrays if it produces multiple outputs
predict_uncertainty(dataset, masks=50)[source]

Predict the model’s outputs, along with the uncertainty in each one.

The uncertainty is computed as described in https://arxiv.org/abs/1703.04977. It involves repeating the prediction many times with different dropout masks. The prediction is computed as the average over all the predictions. The uncertainty includes both the variation among the predicted values (epistemic uncertainty) and the model’s own estimates for how well it fits the data (aleatoric uncertainty). Not all models support uncertainty prediction.

Parameters:
  • dataset (dc.data.Dataset) – Dataset to make prediction on
  • masks (int) – the number of dropout masks to average over
Returns:

  • for each output, a tuple (y_pred, y_std) where y_pred is the predicted
  • value of the output, and each element of y_std estimates the standard
  • deviation of the corresponding element of y_pred

predict_uncertainty_on_batch(X, masks=50)[source]

Predict the model’s outputs, along with the uncertainty in each one.

The uncertainty is computed as described in https://arxiv.org/abs/1703.04977. It involves repeating the prediction many times with different dropout masks. The prediction is computed as the average over all the predictions. The uncertainty includes both the variation among the predicted values (epistemic uncertainty) and the model’s own estimates for how well it fits the data (aleatoric uncertainty). Not all models support uncertainty prediction.

Parameters:
  • X (ndarray) – the input data, as a Numpy array.
  • masks (int) – the number of dropout masks to average over
Returns:

  • for each output, a tuple (y_pred, y_std) where y_pred is the predicted
  • value of the output, and each element of y_std estimates the standard
  • deviation of the corresponding element of y_pred

reload()[source]

Reload trained model from disk.

restore(checkpoint=None, model_dir=None, session=None)[source]

Reload the values of all variables from a checkpoint file.

Parameters:
  • checkpoint (str) – the path to the checkpoint file to load. If this is None, the most recent checkpoint will be chosen automatically. Call get_checkpoints() to get a list of all available checkpoints.
  • model_dir (str, default None) – Directory to restore checkpoint from. If None, use self.model_dir.
  • session (tf.Session(), default None) – Session to run restore ops under. If None, self.session is used.
save()[source]

Dispatcher function for saving.

Each subclass is responsible for overriding this method.

save_checkpoint(max_checkpoints_to_keep=5, model_dir=None)[source]

Save a checkpoint to disk.

Usually you do not need to call this method, since fit() saves checkpoints automatically. If you have disabled automatic checkpointing during fitting, this can be called to manually write checkpoints.

Parameters:
  • max_checkpoints_to_keep (int) – the maximum number of checkpoints to keep. Older checkpoints are discarded.
  • model_dir (str, default None) – Model directory to save checkpoint to. If None, revert to self.model_dir
set_params(**params)[source]

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters:**params (dict) – Estimator parameters.
Returns:self – Estimator instance.
Return type:object

TensorflowMultitaskIRVClassifier

class deepchem.models.TensorflowMultitaskIRVClassifier(*args, **kwargs)[source]
__init__(*args, **kwargs)[source]

Initialize MultitaskIRVClassifier

Parameters:
  • n_tasks (int) – Number of tasks
  • K (int) – Number of nearest neighbours used in classification
  • penalty (float) – Amount of penalty (l2 or l1 applied)
compute_saliency(X)[source]

Compute the saliency map for an input sample.

This computes the Jacobian matrix with the derivative of each output element with respect to each input element. More precisely,

  • If this model has a single output, it returns a matrix of shape (output_shape, input_shape) with the derivatives.
  • If this model has multiple outputs, it returns a list of matrices, one for each output.

This method cannot be used on models that take multiple inputs.

Parameters:X (ndarray) – the input data for a single sample
Returns:
Return type:the Jacobian matrix, or a list of matrices
default_generator(dataset, epochs=1, mode='fit', deterministic=True, pad_batches=True)[source]

Create a generator that iterates batches for a dataset.

Subclasses may override this method to customize how model inputs are generated from the data.

Parameters:
  • dataset (Dataset) – the data to iterate
  • epochs (int) – the number of times to iterate over the full dataset
  • mode (str) – allowed values are ‘fit’ (called during training), ‘predict’ (called during prediction), and ‘uncertainty’ (called during uncertainty prediction)
  • deterministic (bool) – whether to iterate over the dataset in order, or randomly shuffle the data for each epoch
  • pad_batches (bool) – whether to pad each batch up to this model’s preferred batch size
Returns:

  • a generator that iterates batches, each represented as a tuple of lists
  • ([inputs], [outputs], [weights])

evaluate(dataset, metrics, transformers=[], per_task_metrics=False)[source]

Evaluates the performance of this model on specified dataset.

Parameters:
  • dataset (dc.data.Dataset) – Dataset object.
  • metric (deepchem.metrics.Metric) – Evaluation metric
  • transformers (list) – List of deepchem.transformers.Transformer
  • per_task_metrics (bool) – If True, return per-task scores.
Returns:

Maps tasks to scores under metric.

Return type:

dict

evaluate_generator(generator, metrics, transformers=[], per_task_metrics=False)[source]

Evaluate the performance of this model on the data produced by a generator.

Parameters:
  • generator (generator) – this should generate batches, each represented as a tuple of the form (inputs, labels, weights).
  • metric (deepchem.metrics.Metric) – Evaluation metric
  • transformers (list of dc.trans.Transformers) – Transformers that the input data has been transformed by. The output is passed through these transformers to undo the transformations.
  • per_task_metrics (bool) – If True, return per-task scores.
Returns:

Maps tasks to scores under metric.

Return type:

dict

fit(dataset, nb_epoch=10, max_checkpoints_to_keep=5, checkpoint_interval=1000, deterministic=False, restore=False, variables=None, loss=None, callbacks=[])[source]

Train this model on a dataset.

Parameters:
  • dataset (Dataset) – the Dataset to train on
  • nb_epoch (int) – the number of epochs to train for
  • max_checkpoints_to_keep (int) – the maximum number of checkpoints to keep. Older checkpoints are discarded.
  • checkpoint_interval (int) – the frequency at which to write checkpoints, measured in training steps. Set this to 0 to disable automatic checkpointing.
  • deterministic (bool) – if True, the samples are processed in order. If False, a different random order is used for each epoch.
  • restore (bool) – if True, restore the model from the most recent checkpoint and continue training from there. If False, retrain the model from scratch.
  • variables (list of tf.Variable) – the variables to train. If None (the default), all trainable variables in the model are used.
  • loss (function) – a function of the form f(outputs, labels, weights) that computes the loss for each batch. If None (the default), the model’s standard loss function is used.
  • callbacks (function or list of functions) – one or more functions of the form f(model, step) that will be invoked after every step. This can be used to perform validation, logging, etc.
fit_generator(generator, max_checkpoints_to_keep=5, checkpoint_interval=1000, restore=False, variables=None, loss=None, callbacks=[])[source]

Train this model on data from a generator.

Parameters:
  • generator (generator) – this should generate batches, each represented as a tuple of the form (inputs, labels, weights).
  • max_checkpoints_to_keep (int) – the maximum number of checkpoints to keep. Older checkpoints are discarded.
  • checkpoint_interval (int) – the frequency at which to write checkpoints, measured in training steps. Set this to 0 to disable automatic checkpointing.
  • restore (bool) – if True, restore the model from the most recent checkpoint and continue training from there. If False, retrain the model from scratch.
  • variables (list of tf.Variable) – the variables to train. If None (the default), all trainable variables in the model are used.
  • loss (function) – a function of the form f(outputs, labels, weights) that computes the loss for each batch. If None (the default), the model’s standard loss function is used.
  • callbacks (function or list of functions) – one or more functions of the form f(model, step) that will be invoked after every step. This can be used to perform validation, logging, etc.
Returns:

Return type:

the average loss over the most recent checkpoint interval

fit_on_batch(X, y, w, variables=None, loss=None, callbacks=[], checkpoint=True, max_checkpoints_to_keep=5)[source]

Perform a single step of training.

Parameters:
  • X (ndarray) – the inputs for the batch
  • y (ndarray) – the labels for the batch
  • w (ndarray) – the weights for the batch
  • variables (list of tf.Variable) – the variables to train. If None (the default), all trainable variables in the model are used.
  • loss (function) – a function of the form f(outputs, labels, weights) that computes the loss for each batch. If None (the default), the model’s standard loss function is used.
  • callbacks (function or list of functions) – one or more functions of the form f(model, step) that will be invoked after every step. This can be used to perform validation, logging, etc.
  • checkpoint (bool) – if true, save a checkpoint after performing the training step
  • max_checkpoints_to_keep (int) – the maximum number of checkpoints to keep. Older checkpoints are discarded.
get_checkpoints(model_dir=None)[source]

Get a list of all available checkpoint files.

Parameters:model_dir (str, default None) – Directory to get list of checkpoints from. Reverts to self.model_dir if None
get_global_step()[source]

Get the number of steps of fitting that have been performed.

static get_model_filename(model_dir)[source]

Given model directory, obtain filename for the model itself.

get_num_tasks()[source]

Get number of tasks.

get_params(deep=True)[source]

Get parameters for this estimator.

Parameters:deep (bool, default=True) – If True, will return the parameters for this estimator and contained subobjects that are estimators.
Returns:params – Parameter names mapped to their values.
Return type:mapping of string to any
static get_params_filename(model_dir)[source]

Given model directory, obtain filename for the model itself.

get_task_type()[source]

Currently models can only be classifiers or regressors.

load_from_pretrained(source_model, assignment_map=None, value_map=None, checkpoint=None, model_dir=None, include_top=True, inputs=None, **kwargs)[source]

Copies variable values from a pretrained model. source_model can either be a pretrained model or a model with the same architecture. value_map is a variable-value dictionary. If no value_map is provided, the variable values are restored to the source_model from a checkpoint and a default value_map is created. assignment_map is a dictionary mapping variables from the source_model to the current model. If no assignment_map is provided, one is made from scratch and assumes the model is composed of several different layers, with the final one being a dense layer. include_top is used to control whether or not the final dense layer is used. The default assignment map is useful in cases where the type of task is different (classification vs regression) and/or number of tasks in the setting.

Parameters:
  • source_model (dc.KerasModel, required) – source_model can either be the pretrained model or a dc.KerasModel with the same architecture as the pretrained model. It is used to restore from a checkpoint, if value_map is None and to create a default assignment map if assignment_map is None
  • assignment_map (Dict, default None) – Dictionary mapping the source_model variables and current model variables
  • value_map (Dict, default None) – Dictionary containing source_model trainable variables mapped to numpy arrays. If value_map is None, the values are restored and a default variable map is created using the restored values
  • checkpoint (str, default None) – the path to the checkpoint file to load. If this is None, the most recent checkpoint will be chosen automatically. Call get_checkpoints() to get a list of all available checkpoints
  • model_dir (str, default None) – Restore model from custom model directory if needed
  • include_top (bool, default True) – if True, copies the weights and bias associated with the final dense layer. Used only when assignment map is None
  • inputs (List, input tensors for model) – if not None, then the weights are built for both the source and self. This option is useful only for models that are built by subclassing tf.keras.Model, and not using the functional API by tf.keras
predict(dataset, transformers=[], outputs=None, output_types=None)[source]

Uses self to make predictions on provided Dataset object.

Parameters:
  • dataset (dc.data.Dataset) – Dataset to make prediction on
  • transformers (list of dc.trans.Transformers) – Transformers that the input data has been transformed by. The output is passed through these transformers to undo the transformations.
  • outputs (Tensor or list of Tensors) – The outputs to return. If this is None, the model’s standard prediction outputs will be returned. Alternatively one or more Tensors within the model may be specified, in which case the output of those Tensors will be returned.
  • output_types (list of Strings) – The output types to return. Will retrieve all outputs of these types from the model.
Returns:

  • a NumPy array of the model produces a single output, or a list of arrays
  • if it produces multiple outputs

predict_embedding(dataset)[source]

Predicts embeddings created by underlying model if any exist. An embedding must be specified to have output_type of ‘embedding’ in the model definition.

Parameters:dataset (dc.data.Dataset) – Dataset to make prediction on
Returns:
  • a NumPy array of the embeddings model produces, or a list
  • of arrays if it produces multiple embeddings
predict_on_batch(X, transformers=[], outputs=None)[source]

Generates predictions for input samples, processing samples in a batch.

Parameters:
  • X (ndarray) – the input data, as a Numpy array.
  • transformers (list of dc.trans.Transformers) – Transformers that the input data has been transformed by. The output is passed through these transformers to undo the transformations.
  • outputs (Tensor or list of Tensors) – The outputs to return. If this is None, the model’s standard prediction outputs will be returned. Alternatively one or more Tensors within the model may be specified, in which case the output of those Tensors will be returned.
Returns:

  • a NumPy array of the model produces a single output, or a list of arrays
  • if it produces multiple outputs

predict_on_generator(generator, transformers=[], outputs=None, output_types=None)[source]
Parameters:
  • generator (generator) – this should generate batches, each represented as a tuple of the form (inputs, labels, weights).
  • transformers (list of dc.trans.Transformers) – Transformers that the input data has been transformed by. The output is passed through these transformers to undo the transformations.
  • outputs (Tensor or list of Tensors) – The outputs to return. If this is None, the model’s standard prediction outputs will be returned. Alternatively one or more Tensors within the model may be specified, in which case the output of those Tensors will be returned. If outputs is specified, output_types must be None.
  • output_types (String or list of Strings) – If specified, all outputs of this type will be retrieved from the model. If output_types is specified, outputs must be None.
  • Returns – a NumPy array of the model produces a single output, or a list of arrays if it produces multiple outputs
predict_uncertainty(dataset, masks=50)[source]

Predict the model’s outputs, along with the uncertainty in each one.

The uncertainty is computed as described in https://arxiv.org/abs/1703.04977. It involves repeating the prediction many times with different dropout masks. The prediction is computed as the average over all the predictions. The uncertainty includes both the variation among the predicted values (epistemic uncertainty) and the model’s own estimates for how well it fits the data (aleatoric uncertainty). Not all models support uncertainty prediction.

Parameters:
  • dataset (dc.data.Dataset) – Dataset to make prediction on
  • masks (int) – the number of dropout masks to average over
Returns:

  • for each output, a tuple (y_pred, y_std) where y_pred is the predicted
  • value of the output, and each element of y_std estimates the standard
  • deviation of the corresponding element of y_pred

predict_uncertainty_on_batch(X, masks=50)[source]

Predict the model’s outputs, along with the uncertainty in each one.

The uncertainty is computed as described in https://arxiv.org/abs/1703.04977. It involves repeating the prediction many times with different dropout masks. The prediction is computed as the average over all the predictions. The uncertainty includes both the variation among the predicted values (epistemic uncertainty) and the model’s own estimates for how well it fits the data (aleatoric uncertainty). Not all models support uncertainty prediction.

Parameters:
  • X (ndarray) – the input data, as a Numpy array.
  • masks (int) – the number of dropout masks to average over
Returns:

  • for each output, a tuple (y_pred, y_std) where y_pred is the predicted
  • value of the output, and each element of y_std estimates the standard
  • deviation of the corresponding element of y_pred

reload()[source]

Reload trained model from disk.

restore(checkpoint=None, model_dir=None, session=None)[source]

Reload the values of all variables from a checkpoint file.

Parameters:
  • checkpoint (str) – the path to the checkpoint file to load. If this is None, the most recent checkpoint will be chosen automatically. Call get_checkpoints() to get a list of all available checkpoints.
  • model_dir (str, default None) – Directory to restore checkpoint from. If None, use self.model_dir.
  • session (tf.Session(), default None) – Session to run restore ops under. If None, self.session is used.
save()[source]

Dispatcher function for saving.

Each subclass is responsible for overriding this method.

save_checkpoint(max_checkpoints_to_keep=5, model_dir=None)[source]

Save a checkpoint to disk.

Usually you do not need to call this method, since fit() saves checkpoints automatically. If you have disabled automatic checkpointing during fitting, this can be called to manually write checkpoints.

Parameters:
  • max_checkpoints_to_keep (int) – the maximum number of checkpoints to keep. Older checkpoints are discarded.
  • model_dir (str, default None) – Model directory to save checkpoint to. If None, revert to self.model_dir
set_params(**params)[source]

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters:**params (dict) – Estimator parameters.
Returns:self – Estimator instance.
Return type:object

RobustMultitaskClassifier

class deepchem.models.RobustMultitaskClassifier(n_tasks, n_features, layer_sizes=[1000], weight_init_stddevs=0.02, bias_init_consts=1.0, weight_decay_penalty=0.0, weight_decay_penalty_type='l2', dropouts=0.5, activation_fns=<function relu>, n_classes=2, bypass_layer_sizes=[100], bypass_weight_init_stddevs=[0.02], bypass_bias_init_consts=[1.0], bypass_dropouts=[0.5], **kwargs)[source]

Implements a neural network for robust multitasking.

The key idea of this model is to have bypass layers that feed directly from features to task output. This might provide some flexibility toroute around challenges in multitasking with destructive interference.

References

This technique was introduced in [1]_

[1]Ramsundar, Bharath, et al. “Is multitask deep learning practical for pharma?.” Journal of chemical information and modeling 57.8 (2017): 2068-2076.
__init__(n_tasks, n_features, layer_sizes=[1000], weight_init_stddevs=0.02, bias_init_consts=1.0, weight_decay_penalty=0.0, weight_decay_penalty_type='l2', dropouts=0.5, activation_fns=<function relu>, n_classes=2, bypass_layer_sizes=[100], bypass_weight_init_stddevs=[0.02], bypass_bias_init_consts=[1.0], bypass_dropouts=[0.5], **kwargs)[source]

Create a RobustMultitaskClassifier.

Parameters:
  • n_tasks (int) – number of tasks
  • n_features (int) – number of features
  • layer_sizes (list) – the size of each dense layer in the network. The length of this list determines the number of layers.
  • weight_init_stddevs (list or float) – the standard deviation of the distribution to use for weight initialization of each layer. The length of this list should equal len(layer_sizes). Alternatively this may be a single value instead of a list, in which case the same value is used for every layer.
  • bias_init_consts (list or loat) – the value to initialize the biases in each layer to. The length of this list should equal len(layer_sizes). Alternatively this may be a single value instead of a list, in which case the same value is used for every layer.
  • weight_decay_penalty (float) – the magnitude of the weight decay penalty to use
  • weight_decay_penalty_type (str) – the type of penalty to use for weight decay, either ‘l1’ or ‘l2’
  • dropouts (list or float) – the dropout probablity to use for each layer. The length of this list should equal len(layer_sizes). Alternatively this may be a single value instead of a list, in which case the same value is used for every layer.
  • activation_fns (list or object) – the Tensorflow activation function to apply to each layer. The length of this list should equal len(layer_sizes). Alternatively this may be a single value instead of a list, in which case the same value is used for every layer.
  • n_classes (int) – the number of classes
  • bypass_layer_sizes (list) – the size of each dense layer in the bypass network. The length of this list determines the number of bypass layers.
  • bypass_weight_init_stddevs (list or float) – the standard deviation of the distribution to use for weight initialization of bypass layers. same requirements as weight_init_stddevs
  • bypass_bias_init_consts (list or float) – the value to initialize the biases in bypass layers same requirements as bias_init_consts
  • bypass_dropouts (list or float) – the dropout probablity to use for bypass layers. same requirements as dropouts
compute_saliency(X)[source]

Compute the saliency map for an input sample.

This computes the Jacobian matrix with the derivative of each output element with respect to each input element. More precisely,

  • If this model has a single output, it returns a matrix of shape (output_shape, input_shape) with the derivatives.
  • If this model has multiple outputs, it returns a list of matrices, one for each output.

This method cannot be used on models that take multiple inputs.

Parameters:X (ndarray) – the input data for a single sample
Returns:
Return type:the Jacobian matrix, or a list of matrices
default_generator(dataset, epochs=1, mode='fit', deterministic=True, pad_batches=True)[source]

Create a generator that iterates batches for a dataset.

Subclasses may override this method to customize how model inputs are generated from the data.

Parameters:
  • dataset (Dataset) – the data to iterate
  • epochs (int) – the number of times to iterate over the full dataset
  • mode (str) – allowed values are ‘fit’ (called during training), ‘predict’ (called during prediction), and ‘uncertainty’ (called during uncertainty prediction)
  • deterministic (bool) – whether to iterate over the dataset in order, or randomly shuffle the data for each epoch
  • pad_batches (bool) – whether to pad each batch up to this model’s preferred batch size
Returns:

  • a generator that iterates batches, each represented as a tuple of lists
  • ([inputs], [outputs], [weights])

evaluate(dataset, metrics, transformers=[], per_task_metrics=False)[source]

Evaluates the performance of this model on specified dataset.

Parameters:
  • dataset (dc.data.Dataset) – Dataset object.
  • metric (deepchem.metrics.Metric) – Evaluation metric
  • transformers (list) – List of deepchem.transformers.Transformer
  • per_task_metrics (bool) – If True, return per-task scores.
Returns:

Maps tasks to scores under metric.

Return type:

dict

evaluate_generator(generator, metrics, transformers=[], per_task_metrics=False)[source]

Evaluate the performance of this model on the data produced by a generator.

Parameters:
  • generator (generator) – this should generate batches, each represented as a tuple of the form (inputs, labels, weights).
  • metric (deepchem.metrics.Metric) – Evaluation metric
  • transformers (list of dc.trans.Transformers) – Transformers that the input data has been transformed by. The output is passed through these transformers to undo the transformations.
  • per_task_metrics (bool) – If True, return per-task scores.
Returns:

Maps tasks to scores under metric.

Return type:

dict

fit(dataset, nb_epoch=10, max_checkpoints_to_keep=5, checkpoint_interval=1000, deterministic=False, restore=False, variables=None, loss=None, callbacks=[])[source]

Train this model on a dataset.

Parameters:
  • dataset (Dataset) – the Dataset to train on
  • nb_epoch (int) – the number of epochs to train for
  • max_checkpoints_to_keep (int) – the maximum number of checkpoints to keep. Older checkpoints are discarded.
  • checkpoint_interval (int) – the frequency at which to write checkpoints, measured in training steps. Set this to 0 to disable automatic checkpointing.
  • deterministic (bool) – if True, the samples are processed in order. If False, a different random order is used for each epoch.
  • restore (bool) – if True, restore the model from the most recent checkpoint and continue training from there. If False, retrain the model from scratch.
  • variables (list of tf.Variable) – the variables to train. If None (the default), all trainable variables in the model are used.
  • loss (function) – a function of the form f(outputs, labels, weights) that computes the loss for each batch. If None (the default), the model’s standard loss function is used.
  • callbacks (function or list of functions) – one or more functions of the form f(model, step) that will be invoked after every step. This can be used to perform validation, logging, etc.
fit_generator(generator, max_checkpoints_to_keep=5, checkpoint_interval=1000, restore=False, variables=None, loss=None, callbacks=[])[source]

Train this model on data from a generator.

Parameters:
  • generator (generator) – this should generate batches, each represented as a tuple of the form (inputs, labels, weights).
  • max_checkpoints_to_keep (int) – the maximum number of checkpoints to keep. Older checkpoints are discarded.
  • checkpoint_interval (int) – the frequency at which to write checkpoints, measured in training steps. Set this to 0 to disable automatic checkpointing.
  • restore (bool) – if True, restore the model from the most recent checkpoint and continue training from there. If False, retrain the model from scratch.
  • variables (list of tf.Variable) – the variables to train. If None (the default), all trainable variables in the model are used.
  • loss (function) – a function of the form f(outputs, labels, weights) that computes the loss for each batch. If None (the default), the model’s standard loss function is used.
  • callbacks (function or list of functions) – one or more functions of the form f(model, step) that will be invoked after every step. This can be used to perform validation, logging, etc.
Returns:

Return type:

the average loss over the most recent checkpoint interval

fit_on_batch(X, y, w, variables=None, loss=None, callbacks=[], checkpoint=True, max_checkpoints_to_keep=5)[source]

Perform a single step of training.

Parameters:
  • X (ndarray) – the inputs for the batch
  • y (ndarray) – the labels for the batch
  • w (ndarray) – the weights for the batch
  • variables (list of tf.Variable) – the variables to train. If None (the default), all trainable variables in the model are used.
  • loss (function) – a function of the form f(outputs, labels, weights) that computes the loss for each batch. If None (the default), the model’s standard loss function is used.
  • callbacks (function or list of functions) – one or more functions of the form f(model, step) that will be invoked after every step. This can be used to perform validation, logging, etc.
  • checkpoint (bool) – if true, save a checkpoint after performing the training step
  • max_checkpoints_to_keep (int) – the maximum number of checkpoints to keep. Older checkpoints are discarded.
get_checkpoints(model_dir=None)[source]

Get a list of all available checkpoint files.

Parameters:model_dir (str, default None) – Directory to get list of checkpoints from. Reverts to self.model_dir if None
get_global_step()[source]

Get the number of steps of fitting that have been performed.

static get_model_filename(model_dir)[source]

Given model directory, obtain filename for the model itself.

get_num_tasks()[source]

Get number of tasks.

get_params(deep=True)[source]

Get parameters for this estimator.

Parameters:deep (bool, default=True) – If True, will return the parameters for this estimator and contained subobjects that are estimators.
Returns:params – Parameter names mapped to their values.
Return type:mapping of string to any
static get_params_filename(model_dir)[source]

Given model directory, obtain filename for the model itself.

get_task_type()[source]

Currently models can only be classifiers or regressors.

load_from_pretrained(source_model, assignment_map=None, value_map=None, checkpoint=None, model_dir=None, include_top=True, inputs=None, **kwargs)[source]

Copies variable values from a pretrained model. source_model can either be a pretrained model or a model with the same architecture. value_map is a variable-value dictionary. If no value_map is provided, the variable values are restored to the source_model from a checkpoint and a default value_map is created. assignment_map is a dictionary mapping variables from the source_model to the current model. If no assignment_map is provided, one is made from scratch and assumes the model is composed of several different layers, with the final one being a dense layer. include_top is used to control whether or not the final dense layer is used. The default assignment map is useful in cases where the type of task is different (classification vs regression) and/or number of tasks in the setting.

Parameters:
  • source_model (dc.KerasModel, required) – source_model can either be the pretrained model or a dc.KerasModel with the same architecture as the pretrained model. It is used to restore from a checkpoint, if value_map is None and to create a default assignment map if assignment_map is None
  • assignment_map (Dict, default None) – Dictionary mapping the source_model variables and current model variables
  • value_map (Dict, default None) – Dictionary containing source_model trainable variables mapped to numpy arrays. If value_map is None, the values are restored and a default variable map is created using the restored values
  • checkpoint (str, default None) – the path to the checkpoint file to load. If this is None, the most recent checkpoint will be chosen automatically. Call get_checkpoints() to get a list of all available checkpoints
  • model_dir (str, default None) – Restore model from custom model directory if needed
  • include_top (bool, default True) – if True, copies the weights and bias associated with the final dense layer. Used only when assignment map is None
  • inputs (List, input tensors for model) – if not None, then the weights are built for both the source and self. This option is useful only for models that are built by subclassing tf.keras.Model, and not using the functional API by tf.keras
predict(dataset, transformers=[], outputs=None, output_types=None)[source]

Uses self to make predictions on provided Dataset object.

Parameters:
  • dataset (dc.data.Dataset) – Dataset to make prediction on
  • transformers (list of dc.trans.Transformers) – Transformers that the input data has been transformed by. The output is passed through these transformers to undo the transformations.
  • outputs (Tensor or list of Tensors) – The outputs to return. If this is None, the model’s standard prediction outputs will be returned. Alternatively one or more Tensors within the model may be specified, in which case the output of those Tensors will be returned.
  • output_types (list of Strings) – The output types to return. Will retrieve all outputs of these types from the model.
Returns:

  • a NumPy array of the model produces a single output, or a list of arrays
  • if it produces multiple outputs

predict_embedding(dataset)[source]

Predicts embeddings created by underlying model if any exist. An embedding must be specified to have output_type of ‘embedding’ in the model definition.

Parameters:dataset (dc.data.Dataset) – Dataset to make prediction on
Returns:
  • a NumPy array of the embeddings model produces, or a list
  • of arrays if it produces multiple embeddings
predict_on_batch(X, transformers=[], outputs=None)[source]

Generates predictions for input samples, processing samples in a batch.

Parameters:
  • X (ndarray) – the input data, as a Numpy array.
  • transformers (list of dc.trans.Transformers) – Transformers that the input data has been transformed by. The output is passed through these transformers to undo the transformations.
  • outputs (Tensor or list of Tensors) – The outputs to return. If this is None, the model’s standard prediction outputs will be returned. Alternatively one or more Tensors within the model may be specified, in which case the output of those Tensors will be returned.
Returns:

  • a NumPy array of the model produces a single output, or a list of arrays
  • if it produces multiple outputs

predict_on_generator(generator, transformers=[], outputs=None, output_types=None)[source]
Parameters:
  • generator (generator) – this should generate batches, each represented as a tuple of the form (inputs, labels, weights).
  • transformers (list of dc.trans.Transformers) – Transformers that the input data has been transformed by. The output is passed through these transformers to undo the transformations.
  • outputs (Tensor or list of Tensors) – The outputs to return. If this is None, the model’s standard prediction outputs will be returned. Alternatively one or more Tensors within the model may be specified, in which case the output of those Tensors will be returned. If outputs is specified, output_types must be None.
  • output_types (String or list of Strings) – If specified, all outputs of this type will be retrieved from the model. If output_types is specified, outputs must be None.
  • Returns – a NumPy array of the model produces a single output, or a list of arrays if it produces multiple outputs
predict_uncertainty(dataset, masks=50)[source]

Predict the model’s outputs, along with the uncertainty in each one.

The uncertainty is computed as described in https://arxiv.org/abs/1703.04977. It involves repeating the prediction many times with different dropout masks. The prediction is computed as the average over all the predictions. The uncertainty includes both the variation among the predicted values (epistemic uncertainty) and the model’s own estimates for how well it fits the data (aleatoric uncertainty). Not all models support uncertainty prediction.

Parameters:
  • dataset (dc.data.Dataset) – Dataset to make prediction on
  • masks (int) – the number of dropout masks to average over
Returns:

  • for each output, a tuple (y_pred, y_std) where y_pred is the predicted
  • value of the output, and each element of y_std estimates the standard
  • deviation of the corresponding element of y_pred

predict_uncertainty_on_batch(X, masks=50)[source]

Predict the model’s outputs, along with the uncertainty in each one.

The uncertainty is computed as described in https://arxiv.org/abs/1703.04977. It involves repeating the prediction many times with different dropout masks. The prediction is computed as the average over all the predictions. The uncertainty includes both the variation among the predicted values (epistemic uncertainty) and the model’s own estimates for how well it fits the data (aleatoric uncertainty). Not all models support uncertainty prediction.

Parameters:
  • X (ndarray) – the input data, as a Numpy array.
  • masks (int) – the number of dropout masks to average over
Returns:

  • for each output, a tuple (y_pred, y_std) where y_pred is the predicted
  • value of the output, and each element of y_std estimates the standard
  • deviation of the corresponding element of y_pred

reload()[source]

Reload trained model from disk.

restore(checkpoint=None, model_dir=None, session=None)[source]

Reload the values of all variables from a checkpoint file.

Parameters:
  • checkpoint (str) – the path to the checkpoint file to load. If this is None, the most recent checkpoint will be chosen automatically. Call get_checkpoints() to get a list of all available checkpoints.
  • model_dir (str, default None) – Directory to restore checkpoint from. If None, use self.model_dir.
  • session (tf.Session(), default None) – Session to run restore ops under. If None, self.session is used.
save()[source]

Dispatcher function for saving.

Each subclass is responsible for overriding this method.

save_checkpoint(max_checkpoints_to_keep=5, model_dir=None)[source]

Save a checkpoint to disk.

Usually you do not need to call this method, since fit() saves checkpoints automatically. If you have disabled automatic checkpointing during fitting, this can be called to manually write checkpoints.

Parameters:
  • max_checkpoints_to_keep (int) – the maximum number of checkpoints to keep. Older checkpoints are discarded.
  • model_dir (str, default None) – Model directory to save checkpoint to. If None, revert to self.model_dir
set_params(**params)[source]

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters:**params (dict) – Estimator parameters.
Returns:self – Estimator instance.
Return type:object

RobustMultitaskRegressor

class deepchem.models.RobustMultitaskRegressor(n_tasks, n_features, layer_sizes=[1000], weight_init_stddevs=0.02, bias_init_consts=1.0, weight_decay_penalty=0.0, weight_decay_penalty_type='l2', dropouts=0.5, activation_fns=<function relu>, bypass_layer_sizes=[100], bypass_weight_init_stddevs=[0.02], bypass_bias_init_consts=[1.0], bypass_dropouts=[0.5], **kwargs)[source]

Implements a neural network for robust multitasking.

The key idea of this model is to have bypass layers that feed directly from features to task output. This might provide some flexibility toroute around challenges in multitasking with destructive interference.

References

[1]Ramsundar, Bharath, et al. “Is multitask deep learning practical for pharma?.” Journal of chemical information and modeling 57.8 (2017): 2068-2076.
__init__(n_tasks, n_features, layer_sizes=[1000], weight_init_stddevs=0.02, bias_init_consts=1.0, weight_decay_penalty=0.0, weight_decay_penalty_type='l2', dropouts=0.5, activation_fns=<function relu>, bypass_layer_sizes=[100], bypass_weight_init_stddevs=[0.02], bypass_bias_init_consts=[1.0], bypass_dropouts=[0.5], **kwargs)[source]

Create a RobustMultitaskRegressor.

Parameters:
  • n_tasks (int) – number of tasks
  • n_features (int) – number of features
  • layer_sizes (list) – the size of each dense layer in the network. The length of this list determines the number of layers.
  • weight_init_stddevs (list or float) – the standard deviation of the distribution to use for weight initialization of each layer. The length of this list should equal len(layer_sizes). Alternatively this may be a single value instead of a list, in which case the same value is used for every layer.
  • bias_init_consts (list or loat) – the value to initialize the biases in each layer to. The length of this list should equal len(layer_sizes). Alternatively this may be a single value instead of a list, in which case the same value is used for every layer.
  • weight_decay_penalty (float) – the magnitude of the weight decay penalty to use
  • weight_decay_penalty_type (str) – the type of penalty to use for weight decay, either ‘l1’ or ‘l2’
  • dropouts (list or float) – the dropout probablity to use for each layer. The length of this list should equal len(layer_sizes). Alternatively this may be a single value instead of a list, in which case the same value is used for every layer.
  • activation_fns (list or object) – the Tensorflow activation function to apply to each layer. The length of this list should equal len(layer_sizes). Alternatively this may be a single value instead of a list, in which case the same value is used for every layer.
  • bypass_layer_sizes (list) – the size of each dense layer in the bypass network. The length of this list determines the number of bypass layers.
  • bypass_weight_init_stddevs (list or float) – the standard deviation of the distribution to use for weight initialization of bypass layers. same requirements as weight_init_stddevs
  • bypass_bias_init_consts (list or float) – the value to initialize the biases in bypass layers same requirements as bias_init_consts
  • bypass_dropouts (list or float) – the dropout probablity to use for bypass layers. same requirements as dropouts
compute_saliency(X)[source]

Compute the saliency map for an input sample.

This computes the Jacobian matrix with the derivative of each output element with respect to each input element. More precisely,

  • If this model has a single output, it returns a matrix of shape (output_shape, input_shape) with the derivatives.
  • If this model has multiple outputs, it returns a list of matrices, one for each output.

This method cannot be used on models that take multiple inputs.

Parameters:X (ndarray) – the input data for a single sample
Returns:
Return type:the Jacobian matrix, or a list of matrices
default_generator(dataset, epochs=1, mode='fit', deterministic=True, pad_batches=True)[source]

Create a generator that iterates batches for a dataset.

Subclasses may override this method to customize how model inputs are generated from the data.

Parameters:
  • dataset (Dataset) – the data to iterate
  • epochs (int) – the number of times to iterate over the full dataset
  • mode (str) – allowed values are ‘fit’ (called during training), ‘predict’ (called during prediction), and ‘uncertainty’ (called during uncertainty prediction)
  • deterministic (bool) – whether to iterate over the dataset in order, or randomly shuffle the data for each epoch
  • pad_batches (bool) – whether to pad each batch up to this model’s preferred batch size
Returns:

  • a generator that iterates batches, each represented as a tuple of lists
  • ([inputs], [outputs], [weights])

evaluate(dataset, metrics, transformers=[], per_task_metrics=False)[source]

Evaluates the performance of this model on specified dataset.

Parameters:
  • dataset (dc.data.Dataset) – Dataset object.
  • metric (deepchem.metrics.Metric) – Evaluation metric
  • transformers (list) – List of deepchem.transformers.Transformer
  • per_task_metrics (bool) – If True, return per-task scores.
Returns:

Maps tasks to scores under metric.

Return type:

dict

evaluate_generator(generator, metrics, transformers=[], per_task_metrics=False)[source]

Evaluate the performance of this model on the data produced by a generator.

Parameters:
  • generator (generator) – this should generate batches, each represented as a tuple of the form (inputs, labels, weights).
  • metric (deepchem.metrics.Metric) – Evaluation metric
  • transformers (list of dc.trans.Transformers) – Transformers that the input data has been transformed by. The output is passed through these transformers to undo the transformations.
  • per_task_metrics (bool) – If True, return per-task scores.
Returns:

Maps tasks to scores under metric.

Return type:

dict

fit(dataset, nb_epoch=10, max_checkpoints_to_keep=5, checkpoint_interval=1000, deterministic=False, restore=False, variables=None, loss=None, callbacks=[])[source]

Train this model on a dataset.

Parameters:
  • dataset (Dataset) – the Dataset to train on
  • nb_epoch (int) – the number of epochs to train for
  • max_checkpoints_to_keep (int) – the maximum number of checkpoints to keep. Older checkpoints are discarded.
  • checkpoint_interval (int) – the frequency at which to write checkpoints, measured in training steps. Set this to 0 to disable automatic checkpointing.
  • deterministic (bool) – if True, the samples are processed in order. If False, a different random order is used for each epoch.
  • restore (bool) – if True, restore the model from the most recent checkpoint and continue training from there. If False, retrain the model from scratch.
  • variables (list of tf.Variable) – the variables to train. If None (the default), all trainable variables in the model are used.
  • loss (function) – a function of the form f(outputs, labels, weights) that computes the loss for each batch. If None (the default), the model’s standard loss function is used.
  • callbacks (function or list of functions) – one or more functions of the form f(model, step) that will be invoked after every step. This can be used to perform validation, logging, etc.
fit_generator(generator, max_checkpoints_to_keep=5, checkpoint_interval=1000, restore=False, variables=None, loss=None, callbacks=[])[source]

Train this model on data from a generator.

Parameters:
  • generator (generator) – this should generate batches, each represented as a tuple of the form (inputs, labels, weights).
  • max_checkpoints_to_keep (int) – the maximum number of checkpoints to keep. Older checkpoints are discarded.
  • checkpoint_interval (int) – the frequency at which to write checkpoints, measured in training steps. Set this to 0 to disable automatic checkpointing.
  • restore (bool) – if True, restore the model from the most recent checkpoint and continue training from there. If False, retrain the model from scratch.
  • variables (list of tf.Variable) – the variables to train. If None (the default), all trainable variables in the model are used.
  • loss (function) – a function of the form f(outputs, labels, weights) that computes the loss for each batch. If None (the default), the model’s standard loss function is used.
  • callbacks (function or list of functions) – one or more functions of the form f(model, step) that will be invoked after every step. This can be used to perform validation, logging, etc.
Returns:

Return type:

the average loss over the most recent checkpoint interval

fit_on_batch(X, y, w, variables=None, loss=None, callbacks=[], checkpoint=True, max_checkpoints_to_keep=5)[source]

Perform a single step of training.

Parameters:
  • X (ndarray) – the inputs for the batch
  • y (ndarray) – the labels for the batch
  • w (ndarray) – the weights for the batch
  • variables (list of tf.Variable) – the variables to train. If None (the default), all trainable variables in the model are used.
  • loss (function) – a function of the form f(outputs, labels, weights) that computes the loss for each batch. If None (the default), the model’s standard loss function is used.
  • callbacks (function or list of functions) – one or more functions of the form f(model, step) that will be invoked after every step. This can be used to perform validation, logging, etc.
  • checkpoint (bool) – if true, save a checkpoint after performing the training step
  • max_checkpoints_to_keep (int) – the maximum number of checkpoints to keep. Older checkpoints are discarded.
get_checkpoints(model_dir=None)[source]

Get a list of all available checkpoint files.

Parameters:model_dir (str, default None) – Directory to get list of checkpoints from. Reverts to self.model_dir if None
get_global_step()[source]

Get the number of steps of fitting that have been performed.

static get_model_filename(model_dir)[source]

Given model directory, obtain filename for the model itself.

get_num_tasks()[source]

Get number of tasks.

get_params(deep=True)[source]

Get parameters for this estimator.

Parameters:deep (bool, default=True) – If True, will return the parameters for this estimator and contained subobjects that are estimators.
Returns:params – Parameter names mapped to their values.
Return type:mapping of string to any
static get_params_filename(model_dir)[source]

Given model directory, obtain filename for the model itself.

get_task_type()[source]

Currently models can only be classifiers or regressors.

load_from_pretrained(source_model, assignment_map=None, value_map=None, checkpoint=None, model_dir=None, include_top=True, inputs=None, **kwargs)[source]

Copies variable values from a pretrained model. source_model can either be a pretrained model or a model with the same architecture. value_map is a variable-value dictionary. If no value_map is provided, the variable values are restored to the source_model from a checkpoint and a default value_map is created. assignment_map is a dictionary mapping variables from the source_model to the current model. If no assignment_map is provided, one is made from scratch and assumes the model is composed of several different layers, with the final one being a dense layer. include_top is used to control whether or not the final dense layer is used. The default assignment map is useful in cases where the type of task is different (classification vs regression) and/or number of tasks in the setting.

Parameters:
  • source_model (dc.KerasModel, required) – source_model can either be the pretrained model or a dc.KerasModel with the same architecture as the pretrained model. It is used to restore from a checkpoint, if value_map is None and to create a default assignment map if assignment_map is None
  • assignment_map (Dict, default None) – Dictionary mapping the source_model variables and current model variables
  • value_map (Dict, default None) – Dictionary containing source_model trainable variables mapped to numpy arrays. If value_map is None, the values are restored and a default variable map is created using the restored values
  • checkpoint (str, default None) – the path to the checkpoint file to load. If this is None, the most recent checkpoint will be chosen automatically. Call get_checkpoints() to get a list of all available checkpoints
  • model_dir (str, default None) – Restore model from custom model directory if needed
  • include_top (bool, default True) – if True, copies the weights and bias associated with the final dense layer. Used only when assignment map is None
  • inputs (List, input tensors for model) – if not None, then the weights are built for both the source and self. This option is useful only for models that are built by subclassing tf.keras.Model, and not using the functional API by tf.keras
predict(dataset, transformers=[], outputs=None, output_types=None)[source]

Uses self to make predictions on provided Dataset object.

Parameters:
  • dataset (dc.data.Dataset) – Dataset to make prediction on
  • transformers (list of dc.trans.Transformers) – Transformers that the input data has been transformed by. The output is passed through these transformers to undo the transformations.
  • outputs (Tensor or list of Tensors) – The outputs to return. If this is None, the model’s standard prediction outputs will be returned. Alternatively one or more Tensors within the model may be specified, in which case the output of those Tensors will be returned.
  • output_types (list of Strings) – The output types to return. Will retrieve all outputs of these types from the model.
Returns:

  • a NumPy array of the model produces a single output, or a list of arrays
  • if it produces multiple outputs

predict_embedding(dataset)[source]

Predicts embeddings created by underlying model if any exist. An embedding must be specified to have output_type of ‘embedding’ in the model definition.

Parameters:dataset (dc.data.Dataset) – Dataset to make prediction on
Returns:
  • a NumPy array of the embeddings model produces, or a list
  • of arrays if it produces multiple embeddings
predict_on_batch(X, transformers=[], outputs=None)[source]

Generates predictions for input samples, processing samples in a batch.

Parameters:
  • X (ndarray) – the input data, as a Numpy array.
  • transformers (list of dc.trans.Transformers) – Transformers that the input data has been transformed by. The output is passed through these transformers to undo the transformations.
  • outputs (Tensor or list of Tensors) – The outputs to return. If this is None, the model’s standard prediction outputs will be returned. Alternatively one or more Tensors within the model may be specified, in which case the output of those Tensors will be returned.
Returns:

  • a NumPy array of the model produces a single output, or a list of arrays
  • if it produces multiple outputs

predict_on_generator(generator, transformers=[], outputs=None, output_types=None)[source]
Parameters:
  • generator (generator) – this should generate batches, each represented as a tuple of the form (inputs, labels, weights).
  • transformers (list of dc.trans.Transformers) – Transformers that the input data has been transformed by. The output is passed through these transformers to undo the transformations.
  • outputs (Tensor or list of Tensors) – The outputs to return. If this is None, the model’s standard prediction outputs will be returned. Alternatively one or more Tensors within the model may be specified, in which case the output of those Tensors will be returned. If outputs is specified, output_types must be None.
  • output_types (String or list of Strings) – If specified, all outputs of this type will be retrieved from the model. If output_types is specified, outputs must be None.
  • Returns – a NumPy array of the model produces a single output, or a list of arrays if it produces multiple outputs
predict_uncertainty(dataset, masks=50)[source]

Predict the model’s outputs, along with the uncertainty in each one.

The uncertainty is computed as described in https://arxiv.org/abs/1703.04977. It involves repeating the prediction many times with different dropout masks. The prediction is computed as the average over all the predictions. The uncertainty includes both the variation among the predicted values (epistemic uncertainty) and the model’s own estimates for how well it fits the data (aleatoric uncertainty). Not all models support uncertainty prediction.

Parameters:
  • dataset (dc.data.Dataset) – Dataset to make prediction on
  • masks (int) – the number of dropout masks to average over
Returns:

  • for each output, a tuple (y_pred, y_std) where y_pred is the predicted
  • value of the output, and each element of y_std estimates the standard
  • deviation of the corresponding element of y_pred

predict_uncertainty_on_batch(X, masks=50)[source]

Predict the model’s outputs, along with the uncertainty in each one.

The uncertainty is computed as described in https://arxiv.org/abs/1703.04977. It involves repeating the prediction many times with different dropout masks. The prediction is computed as the average over all the predictions. The uncertainty includes both the variation among the predicted values (epistemic uncertainty) and the model’s own estimates for how well it fits the data (aleatoric uncertainty). Not all models support uncertainty prediction.

Parameters:
  • X (ndarray) – the input data, as a Numpy array.
  • masks (int) – the number of dropout masks to average over
Returns:

  • for each output, a tuple (y_pred, y_std) where y_pred is the predicted
  • value of the output, and each element of y_std estimates the standard
  • deviation of the corresponding element of y_pred

reload()[source]

Reload trained model from disk.

restore(checkpoint=None, model_dir=None, session=None)[source]

Reload the values of all variables from a checkpoint file.

Parameters:
  • checkpoint (str) – the path to the checkpoint file to load. If this is None, the most recent checkpoint will be chosen automatically. Call get_checkpoints() to get a list of all available checkpoints.
  • model_dir (str, default None) – Directory to restore checkpoint from. If None, use self.model_dir.
  • session (tf.Session(), default None) – Session to run restore ops under. If None, self.session is used.
save()[source]

Dispatcher function for saving.

Each subclass is responsible for overriding this method.

save_checkpoint(max_checkpoints_to_keep=5, model_dir=None)[source]

Save a checkpoint to disk.

Usually you do not need to call this method, since fit() saves checkpoints automatically. If you have disabled automatic checkpointing during fitting, this can be called to manually write checkpoints.

Parameters:
  • max_checkpoints_to_keep (int) – the maximum number of checkpoints to keep. Older checkpoints are discarded.
  • model_dir (str, default None) – Model directory to save checkpoint to. If None, revert to self.model_dir
set_params(**params)[source]

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters:**params (dict) – Estimator parameters.
Returns:self – Estimator instance.
Return type:object

ProgressiveMultitaskClassifier

class deepchem.models.ProgressiveMultitaskClassifier(n_tasks, n_features, alpha_init_stddevs=0.02, layer_sizes=[1000], weight_init_stddevs=0.02, bias_init_consts=1.0, weight_decay_penalty=0.0, weight_decay_penalty_type='l2', dropouts=0.5, activation_fns=<function relu>, **kwargs)[source]

Implements a progressive multitask neural network for classification.

Progressive Networks: https://arxiv.org/pdf/1606.04671v3.pdf

Progressive networks allow for multitask learning where each task gets a new column of weights. As a result, there is no exponential forgetting where previous tasks are ignored.

__init__(n_tasks, n_features, alpha_init_stddevs=0.02, layer_sizes=[1000], weight_init_stddevs=0.02, bias_init_consts=1.0, weight_decay_penalty=0.0, weight_decay_penalty_type='l2', dropouts=0.5, activation_fns=<function relu>, **kwargs)[source]

Creates a progressive network.

Only listing parameters specific to progressive networks here.

Parameters:
  • n_tasks (int) – Number of tasks
  • n_features (int) – Number of input features
  • alpha_init_stddevs (list) – List of standard-deviations for alpha in adapter layers.
  • layer_sizes (list) – the size of each dense layer in the network. The length of this list determines the number of layers.
  • weight_init_stddevs (list or float) – the standard deviation of the distribution to use for weight initialization of each layer. The length of this list should equal len(layer_sizes)+1. The final element corresponds to the output layer. Alternatively this may be a single value instead of a list, in which case the same value is used for every layer.
  • bias_init_consts (list or float) – the value to initialize the biases in each layer to. The length of this list should equal len(layer_sizes)+1. The final element corresponds to the output layer. Alternatively this may be a single value instead of a list, in which case the same value is used for every layer.
  • weight_decay_penalty (float) – the magnitude of the weight decay penalty to use
  • weight_decay_penalty_type (str) – the type of penalty to use for weight decay, either ‘l1’ or ‘l2’
  • dropouts (list or float) – the dropout probablity to use for each layer. The length of this list should equal len(layer_sizes). Alternatively this may be a single value instead of a list, in which case the same value is used for every layer.
  • activation_fns (list or object) – the Tensorflow activation function to apply to each layer. The length of this list should equal len(layer_sizes). Alternatively this may be a single value instead of a list, in which case the same value is used for every layer.
add_adapter(all_layers, task, layer_num)[source]

Add an adapter connection for given task/layer combo

compute_saliency(X)[source]

Compute the saliency map for an input sample.

This computes the Jacobian matrix with the derivative of each output element with respect to each input element. More precisely,

  • If this model has a single output, it returns a matrix of shape (output_shape, input_shape) with the derivatives.
  • If this model has multiple outputs, it returns a list of matrices, one for each output.

This method cannot be used on models that take multiple inputs.

Parameters:X (ndarray) – the input data for a single sample
Returns:
Return type:the Jacobian matrix, or a list of matrices
default_generator(dataset, epochs=1, mode='fit', deterministic=True, pad_batches=True)[source]

Create a generator that iterates batches for a dataset.

Subclasses may override this method to customize how model inputs are generated from the data.

Parameters:
  • dataset (Dataset) – the data to iterate
  • epochs (int) – the number of times to iterate over the full dataset
  • mode (str) – allowed values are ‘fit’ (called during training), ‘predict’ (called during prediction), and ‘uncertainty’ (called during uncertainty prediction)
  • deterministic (bool) – whether to iterate over the dataset in order, or randomly shuffle the data for each epoch
  • pad_batches (bool) – whether to pad each batch up to this model’s preferred batch size
Returns:

  • a generator that iterates batches, each represented as a tuple of lists
  • ([inputs], [outputs], [weights])

evaluate(dataset, metrics, transformers=[], per_task_metrics=False)[source]

Evaluates the performance of this model on specified dataset.

Parameters:
  • dataset (dc.data.Dataset) – Dataset object.
  • metric (deepchem.metrics.Metric) – Evaluation metric
  • transformers (list) – List of deepchem.transformers.Transformer
  • per_task_metrics (bool) – If True, return per-task scores.
Returns:

Maps tasks to scores under metric.

Return type:

dict

evaluate_generator(generator, metrics, transformers=[], per_task_metrics=False)[source]

Evaluate the performance of this model on the data produced by a generator.

Parameters:
  • generator (generator) – this should generate batches, each represented as a tuple of the form (inputs, labels, weights).
  • metric (deepchem.metrics.Metric) – Evaluation metric
  • transformers (list of dc.trans.Transformers) – Transformers that the input data has been transformed by. The output is passed through these transformers to undo the transformations.
  • per_task_metrics (bool) – If True, return per-task scores.
Returns:

Maps tasks to scores under metric.

Return type:

dict

fit(dataset, nb_epoch=10, max_checkpoints_to_keep=5, checkpoint_interval=1000, deterministic=False, restore=False, **kwargs)[source]

Train this model on a dataset.

Parameters:
  • dataset (Dataset) – the Dataset to train on
  • nb_epoch (int) – the number of epochs to train for
  • max_checkpoints_to_keep (int) – the maximum number of checkpoints to keep. Older checkpoints are discarded.
  • checkpoint_interval (int) – the frequency at which to write checkpoints, measured in training steps. Set this to 0 to disable automatic checkpointing.
  • deterministic (bool) – if True, the samples are processed in order. If False, a different random order is used for each epoch.
  • restore (bool) – if True, restore the model from the most recent checkpoint and continue training from there. If False, retrain the model from scratch.
  • variables (list of tf.Variable) – the variables to train. If None (the default), all trainable variables in the model are used.
  • loss (function) – a function of the form f(outputs, labels, weights) that computes the loss for each batch. If None (the default), the model’s standard loss function is used.
  • callbacks (function or list of functions) – one or more functions of the form f(model, step) that will be invoked after every step. This can be used to perform validation, logging, etc.
fit_generator(generator, max_checkpoints_to_keep=5, checkpoint_interval=1000, restore=False, variables=None, loss=None, callbacks=[])[source]

Train this model on data from a generator.

Parameters:
  • generator (generator) – this should generate batches, each represented as a tuple of the form (inputs, labels, weights).
  • max_checkpoints_to_keep (int) – the maximum number of checkpoints to keep. Older checkpoints are discarded.
  • checkpoint_interval (int) – the frequency at which to write checkpoints, measured in training steps. Set this to 0 to disable automatic checkpointing.
  • restore (bool) – if True, restore the model from the most recent checkpoint and continue training from there. If False, retrain the model from scratch.
  • variables (list of tf.Variable) – the variables to train. If None (the default), all trainable variables in the model are used.
  • loss (function) – a function of the form f(outputs, labels, weights) that computes the loss for each batch. If None (the default), the model’s standard loss function is used.
  • callbacks (function or list of functions) – one or more functions of the form f(model, step) that will be invoked after every step. This can be used to perform validation, logging, etc.
Returns:

Return type:

the average loss over the most recent checkpoint interval

fit_on_batch(X, y, w, variables=None, loss=None, callbacks=[], checkpoint=True, max_checkpoints_to_keep=5)[source]

Perform a single step of training.

Parameters:
  • X (ndarray) – the inputs for the batch
  • y (ndarray) – the labels for the batch
  • w (ndarray) – the weights for the batch
  • variables (list of tf.Variable) – the variables to train. If None (the default), all trainable variables in the model are used.
  • loss (function) – a function of the form f(outputs, labels, weights) that computes the loss for each batch. If None (the default), the model’s standard loss function is used.
  • callbacks (function or list of functions) – one or more functions of the form f(model, step) that will be invoked after every step. This can be used to perform validation, logging, etc.
  • checkpoint (bool) – if true, save a checkpoint after performing the training step
  • max_checkpoints_to_keep (int) – the maximum number of checkpoints to keep. Older checkpoints are discarded.
fit_task(dataset, task, nb_epoch=10, max_checkpoints_to_keep=5, checkpoint_interval=1000, deterministic=False, restore=False, **kwargs)[source]

Fit one task.

get_checkpoints(model_dir=None)[source]

Get a list of all available checkpoint files.

Parameters:model_dir (str, default None) – Directory to get list of checkpoints from. Reverts to self.model_dir if None
get_global_step()[source]

Get the number of steps of fitting that have been performed.

static get_model_filename(model_dir)[source]

Given model directory, obtain filename for the model itself.

get_num_tasks()[source]

Get number of tasks.

get_params(deep=True)[source]

Get parameters for this estimator.

Parameters:deep (bool, default=True) – If True, will return the parameters for this estimator and contained subobjects that are estimators.
Returns:params – Parameter names mapped to their values.
Return type:mapping of string to any
static get_params_filename(model_dir)[source]

Given model directory, obtain filename for the model itself.

get_task_type()[source]

Currently models can only be classifiers or regressors.

load_from_pretrained(source_model, assignment_map=None, value_map=None, checkpoint=None, model_dir=None, include_top=True, inputs=None, **kwargs)[source]

Copies variable values from a pretrained model. source_model can either be a pretrained model or a model with the same architecture. value_map is a variable-value dictionary. If no value_map is provided, the variable values are restored to the source_model from a checkpoint and a default value_map is created. assignment_map is a dictionary mapping variables from the source_model to the current model. If no assignment_map is provided, one is made from scratch and assumes the model is composed of several different layers, with the final one being a dense layer. include_top is used to control whether or not the final dense layer is used. The default assignment map is useful in cases where the type of task is different (classification vs regression) and/or number of tasks in the setting.

Parameters:
  • source_model (dc.KerasModel, required) – source_model can either be the pretrained model or a dc.KerasModel with the same architecture as the pretrained model. It is used to restore from a checkpoint, if value_map is None and to create a default assignment map if assignment_map is None
  • assignment_map (Dict, default None) – Dictionary mapping the source_model variables and current model variables
  • value_map (Dict, default None) – Dictionary containing source_model trainable variables mapped to numpy arrays. If value_map is None, the values are restored and a default variable map is created using the restored values
  • checkpoint (str, default None) – the path to the checkpoint file to load. If this is None, the most recent checkpoint will be chosen automatically. Call get_checkpoints() to get a list of all available checkpoints
  • model_dir (str, default None) – Restore model from custom model directory if needed
  • include_top (bool, default True) – if True, copies the weights and bias associated with the final dense layer. Used only when assignment map is None
  • inputs (List, input tensors for model) – if not None, then the weights are built for both the source and self. This option is useful only for models that are built by subclassing tf.keras.Model, and not using the functional API by tf.keras
predict(dataset, transformers=[], outputs=None, output_types=None)[source]

Uses self to make predictions on provided Dataset object.

Parameters:
  • dataset (dc.data.Dataset) – Dataset to make prediction on
  • transformers (list of dc.trans.Transformers) – Transformers that the input data has been transformed by. The output is passed through these transformers to undo the transformations.
  • outputs (Tensor or list of Tensors) – The outputs to return. If this is None, the model’s standard prediction outputs will be returned. Alternatively one or more Tensors within the model may be specified, in which case the output of those Tensors will be returned.
  • output_types (list of Strings) – The output types to return. Will retrieve all outputs of these types from the model.
Returns:

  • a NumPy array of the model produces a single output, or a list of arrays
  • if it produces multiple outputs

predict_embedding(dataset)[source]

Predicts embeddings created by underlying model if any exist. An embedding must be specified to have output_type of ‘embedding’ in the model definition.

Parameters:dataset (dc.data.Dataset) – Dataset to make prediction on
Returns:
  • a NumPy array of the embeddings model produces, or a list
  • of arrays if it produces multiple embeddings
predict_on_batch(X, transformers=[], outputs=None)[source]

Generates predictions for input samples, processing samples in a batch.

Parameters:
  • X (ndarray) – the input data, as a Numpy array.
  • transformers (list of dc.trans.Transformers) – Transformers that the input data has been transformed by. The output is passed through these transformers to undo the transformations.
  • outputs (Tensor or list of Tensors) – The outputs to return. If this is None, the model’s standard prediction outputs will be returned. Alternatively one or more Tensors within the model may be specified, in which case the output of those Tensors will be returned.
Returns:

  • a NumPy array of the model produces a single output, or a list of arrays
  • if it produces multiple outputs

predict_on_generator(generator, transformers=[], outputs=None, output_types=None)[source]
Parameters:
  • generator (generator) – this should generate batches, each represented as a tuple of the form (inputs, labels, weights).
  • transformers (list of dc.trans.Transformers) – Transformers that the input data has been transformed by. The output is passed through these transformers to undo the transformations.
  • outputs (Tensor or list of Tensors) – The outputs to return. If this is None, the model’s standard prediction outputs will be returned. Alternatively one or more Tensors within the model may be specified, in which case the output of those Tensors will be returned. If outputs is specified, output_types must be None.
  • output_types (String or list of Strings) – If specified, all outputs of this type will be retrieved from the model. If output_types is specified, outputs must be None.
  • Returns – a NumPy array of the model produces a single output, or a list of arrays if it produces multiple outputs
predict_uncertainty(dataset, masks=50)[source]

Predict the model’s outputs, along with the uncertainty in each one.

The uncertainty is computed as described in https://arxiv.org/abs/1703.04977. It involves repeating the prediction many times with different dropout masks. The prediction is computed as the average over all the predictions. The uncertainty includes both the variation among the predicted values (epistemic uncertainty) and the model’s own estimates for how well it fits the data (aleatoric uncertainty). Not all models support uncertainty prediction.

Parameters:
  • dataset (dc.data.Dataset) – Dataset to make prediction on
  • masks (int) – the number of dropout masks to average over
Returns:

  • for each output, a tuple (y_pred, y_std) where y_pred is the predicted
  • value of the output, and each element of y_std estimates the standard
  • deviation of the corresponding element of y_pred

predict_uncertainty_on_batch(X, masks=50)[source]

Predict the model’s outputs, along with the uncertainty in each one.

The uncertainty is computed as described in https://arxiv.org/abs/1703.04977. It involves repeating the prediction many times with different dropout masks. The prediction is computed as the average over all the predictions. The uncertainty includes both the variation among the predicted values (epistemic uncertainty) and the model’s own estimates for how well it fits the data (aleatoric uncertainty). Not all models support uncertainty prediction.

Parameters:
  • X (ndarray) – the input data, as a Numpy array.
  • masks (int) – the number of dropout masks to average over
Returns:

  • for each output, a tuple (y_pred, y_std) where y_pred is the predicted
  • value of the output, and each element of y_std estimates the standard
  • deviation of the corresponding element of y_pred

reload()[source]

Reload trained model from disk.

restore(checkpoint=None, model_dir=None, session=None)[source]

Reload the values of all variables from a checkpoint file.

Parameters:
  • checkpoint (str) – the path to the checkpoint file to load. If this is None, the most recent checkpoint will be chosen automatically. Call get_checkpoints() to get a list of all available checkpoints.
  • model_dir (str, default None) – Directory to restore checkpoint from. If None, use self.model_dir.
  • session (tf.Session(), default None) – Session to run restore ops under. If None, self.session is used.
save()[source]

Dispatcher function for saving.

Each subclass is responsible for overriding this method.

save_checkpoint(max_checkpoints_to_keep=5, model_dir=None)[source]

Save a checkpoint to disk.

Usually you do not need to call this method, since fit() saves checkpoints automatically. If you have disabled automatic checkpointing during fitting, this can be called to manually write checkpoints.

Parameters:
  • max_checkpoints_to_keep (int) – the maximum number of checkpoints to keep. Older checkpoints are discarded.
  • model_dir (str, default None) – Model directory to save checkpoint to. If None, revert to self.model_dir
set_params(**params)[source]

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters:**params (dict) – Estimator parameters.
Returns:self – Estimator instance.
Return type:object

ProgressiveMultitaskRegressor

class deepchem.models.ProgressiveMultitaskRegressor(n_tasks, n_features, alpha_init_stddevs=0.02, layer_sizes=[1000], weight_init_stddevs=0.02, bias_init_consts=1.0, weight_decay_penalty=0.0, weight_decay_penalty_type='l2', dropouts=0.5, activation_fns=<function relu>, n_outputs=1, **kwargs)[source]

Implements a progressive multitask neural network for regression.

Progressive networks allow for multitask learning where each task gets a new column of weights. As a result, there is no exponential forgetting where previous tasks are ignored.

References

See [1]_ for a full description of the progressive architecture

[1]Rusu, Andrei A., et al. “Progressive neural networks.” arXiv preprint arXiv:1606.04671 (2016).
__init__(n_tasks, n_features, alpha_init_stddevs=0.02, layer_sizes=[1000], weight_init_stddevs=0.02, bias_init_consts=1.0, weight_decay_penalty=0.0, weight_decay_penalty_type='l2', dropouts=0.5, activation_fns=<function relu>, n_outputs=1, **kwargs)[source]

Creates a progressive network.

Only listing parameters specific to progressive networks here.

Parameters:
  • n_tasks (int) – Number of tasks
  • n_features (int) – Number of input features
  • alpha_init_stddevs (list) – List of standard-deviations for alpha in adapter layers.
  • layer_sizes (list) – the size of each dense layer in the network. The length of this list determines the number of layers.
  • weight_init_stddevs (list or float) – the standard deviation of the distribution to use for weight initialization of each layer. The length of this list should equal len(layer_sizes)+1. The final element corresponds to the output layer. Alternatively this may be a single value instead of a list, in which case the same value is used for every layer.
  • bias_init_consts (list or float) – the value to initialize the biases in each layer to. The length of this list should equal len(layer_sizes)+1. The final element corresponds to the output layer. Alternatively this may be a single value instead of a list, in which case the same value is used for every layer.
  • weight_decay_penalty (float) – the magnitude of the weight decay penalty to use
  • weight_decay_penalty_type (str) – the type of penalty to use for weight decay, either ‘l1’ or ‘l2’
  • dropouts (list or float) – the dropout probablity to use for each layer. The length of this list should equal len(layer_sizes). Alternatively this may be a single value instead of a list, in which case the same value is used for every layer.
  • activation_fns (list or object) – the Tensorflow activation function to apply to each layer. The length of this list should equal len(layer_sizes). Alternatively this may be a single value instead of a list, in which case the same value is used for every layer.
add_adapter(all_layers, task, layer_num)[source]

Add an adapter connection for given task/layer combo

compute_saliency(X)[source]

Compute the saliency map for an input sample.

This computes the Jacobian matrix with the derivative of each output element with respect to each input element. More precisely,

  • If this model has a single output, it returns a matrix of shape (output_shape, input_shape) with the derivatives.
  • If this model has multiple outputs, it returns a list of matrices, one for each output.

This method cannot be used on models that take multiple inputs.

Parameters:X (ndarray) – the input data for a single sample
Returns:
Return type:the Jacobian matrix, or a list of matrices
default_generator(dataset, epochs=1, mode='fit', deterministic=True, pad_batches=True)[source]

Create a generator that iterates batches for a dataset.

Subclasses may override this method to customize how model inputs are generated from the data.

Parameters:
  • dataset (Dataset) – the data to iterate
  • epochs (int) – the number of times to iterate over the full dataset
  • mode (str) – allowed values are ‘fit’ (called during training), ‘predict’ (called during prediction), and ‘uncertainty’ (called during uncertainty prediction)
  • deterministic (bool) – whether to iterate over the dataset in order, or randomly shuffle the data for each epoch
  • pad_batches (bool) – whether to pad each batch up to this model’s preferred batch size
Returns:

  • a generator that iterates batches, each represented as a tuple of lists
  • ([inputs], [outputs], [weights])

evaluate(dataset, metrics, transformers=[], per_task_metrics=False)[source]

Evaluates the performance of this model on specified dataset.

Parameters:
  • dataset (dc.data.Dataset) – Dataset object.
  • metric (deepchem.metrics.Metric) – Evaluation metric
  • transformers (list) – List of deepchem.transformers.Transformer
  • per_task_metrics (bool) – If True, return per-task scores.
Returns:

Maps tasks to scores under metric.

Return type:

dict

evaluate_generator(generator, metrics, transformers=[], per_task_metrics=False)[source]

Evaluate the performance of this model on the data produced by a generator.

Parameters:
  • generator (generator) – this should generate batches, each represented as a tuple of the form (inputs, labels, weights).
  • metric (deepchem.metrics.Metric) – Evaluation metric
  • transformers (list of dc.trans.Transformers) – Transformers that the input data has been transformed by. The output is passed through these transformers to undo the transformations.
  • per_task_metrics (bool) – If True, return per-task scores.
Returns:

Maps tasks to scores under metric.

Return type:

dict

fit(dataset, nb_epoch=10, max_checkpoints_to_keep=5, checkpoint_interval=1000, deterministic=False, restore=False, **kwargs)[source]

Train this model on a dataset.

Parameters:
  • dataset (Dataset) – the Dataset to train on
  • nb_epoch (int) – the number of epochs to train for
  • max_checkpoints_to_keep (int) – the maximum number of checkpoints to keep. Older checkpoints are discarded.
  • checkpoint_interval (int) – the frequency at which to write checkpoints, measured in training steps. Set this to 0 to disable automatic checkpointing.
  • deterministic (bool) – if True, the samples are processed in order. If False, a different random order is used for each epoch.
  • restore (bool) – if True, restore the model from the most recent checkpoint and continue training from there. If False, retrain the model from scratch.
  • variables (list of tf.Variable) – the variables to train. If None (the default), all trainable variables in the model are used.
  • loss (function) – a function of the form f(outputs, labels, weights) that computes the loss for each batch. If None (the default), the model’s standard loss function is used.
  • callbacks (function or list of functions) – one or more functions of the form f(model, step) that will be invoked after every step. This can be used to perform validation, logging, etc.
fit_generator(generator, max_checkpoints_to_keep=5, checkpoint_interval=1000, restore=False, variables=None, loss=None, callbacks=[])[source]

Train this model on data from a generator.

Parameters:
  • generator (generator) – this should generate batches, each represented as a tuple of the form (inputs, labels, weights).
  • max_checkpoints_to_keep (int) – the maximum number of checkpoints to keep. Older checkpoints are discarded.
  • checkpoint_interval (int) – the frequency at which to write checkpoints, measured in training steps. Set this to 0 to disable automatic checkpointing.
  • restore (bool) – if True, restore the model from the most recent checkpoint and continue training from there. If False, retrain the model from scratch.
  • variables (list of tf.Variable) – the variables to train. If None (the default), all trainable variables in the model are used.
  • loss (function) – a function of the form f(outputs, labels, weights) that computes the loss for each batch. If None (the default), the model’s standard loss function is used.
  • callbacks (function or list of functions) – one or more functions of the form f(model, step) that will be invoked after every step. This can be used to perform validation, logging, etc.
Returns:

Return type:

the average loss over the most recent checkpoint interval

fit_on_batch(X, y, w, variables=None, loss=None, callbacks=[], checkpoint=True, max_checkpoints_to_keep=5)[source]

Perform a single step of training.

Parameters:
  • X (ndarray) – the inputs for the batch
  • y (ndarray) – the labels for the batch
  • w (ndarray) – the weights for the batch
  • variables (list of tf.Variable) – the variables to train. If None (the default), all trainable variables in the model are used.
  • loss (function) – a function of the form f(outputs, labels, weights) that computes the loss for each batch. If None (the default), the model’s standard loss function is used.
  • callbacks (function or list of functions) – one or more functions of the form f(model, step) that will be invoked after every step. This can be used to perform validation, logging, etc.
  • checkpoint (bool) – if true, save a checkpoint after performing the training step
  • max_checkpoints_to_keep (int) – the maximum number of checkpoints to keep. Older checkpoints are discarded.
fit_task(dataset, task, nb_epoch=10, max_checkpoints_to_keep=5, checkpoint_interval=1000, deterministic=False, restore=False, **kwargs)[source]

Fit one task.

get_checkpoints(model_dir=None)[source]

Get a list of all available checkpoint files.

Parameters:model_dir (str, default None) – Directory to get list of checkpoints from. Reverts to self.model_dir if None
get_global_step()[source]

Get the number of steps of fitting that have been performed.

static get_model_filename(model_dir)[source]

Given model directory, obtain filename for the model itself.

get_num_tasks()[source]

Get number of tasks.

get_params(deep=True)[source]

Get parameters for this estimator.

Parameters:deep (bool, default=True) – If True, will return the parameters for this estimator and contained subobjects that are estimators.
Returns:params – Parameter names mapped to their values.
Return type:mapping of string to any
static get_params_filename(model_dir)[source]

Given model directory, obtain filename for the model itself.

get_task_type()[source]

Currently models can only be classifiers or regressors.

load_from_pretrained(source_model, assignment_map=None, value_map=None, checkpoint=None, model_dir=None, include_top=True, inputs=None, **kwargs)[source]

Copies variable values from a pretrained model. source_model can either be a pretrained model or a model with the same architecture. value_map is a variable-value dictionary. If no value_map is provided, the variable values are restored to the source_model from a checkpoint and a default value_map is created. assignment_map is a dictionary mapping variables from the source_model to the current model. If no assignment_map is provided, one is made from scratch and assumes the model is composed of several different layers, with the final one being a dense layer. include_top is used to control whether or not the final dense layer is used. The default assignment map is useful in cases where the type of task is different (classification vs regression) and/or number of tasks in the setting.

Parameters:
  • source_model (dc.KerasModel, required) – source_model can either be the pretrained model or a dc.KerasModel with the same architecture as the pretrained model. It is used to restore from a checkpoint, if value_map is None and to create a default assignment map if assignment_map is None
  • assignment_map (Dict, default None) – Dictionary mapping the source_model variables and current model variables
  • value_map (Dict, default None) – Dictionary containing source_model trainable variables mapped to numpy arrays. If value_map is None, the values are restored and a default variable map is created using the restored values
  • checkpoint (str, default None) – the path to the checkpoint file to load. If this is None, the most recent checkpoint will be chosen automatically. Call get_checkpoints() to get a list of all available checkpoints
  • model_dir (str, default None) – Restore model from custom model directory if needed
  • include_top (bool, default True) – if True, copies the weights and bias associated with the final dense layer. Used only when assignment map is None
  • inputs (List, input tensors for model) – if not None, then the weights are built for both the source and self. This option is useful only for models that are built by subclassing tf.keras.Model, and not using the functional API by tf.keras
predict(dataset, transformers=[], outputs=None, output_types=None)[source]

Uses self to make predictions on provided Dataset object.

Parameters:
  • dataset (dc.data.Dataset) – Dataset to make prediction on
  • transformers (list of dc.trans.Transformers) – Transformers that the input data has been transformed by. The output is passed through these transformers to undo the transformations.
  • outputs (Tensor or list of Tensors) – The outputs to return. If this is None, the model’s standard prediction outputs will be returned. Alternatively one or more Tensors within the model may be specified, in which case the output of those Tensors will be returned.
  • output_types (list of Strings) – The output types to return. Will retrieve all outputs of these types from the model.
Returns:

  • a NumPy array of the model produces a single output, or a list of arrays
  • if it produces multiple outputs

predict_embedding(dataset)[source]

Predicts embeddings created by underlying model if any exist. An embedding must be specified to have output_type of ‘embedding’ in the model definition.

Parameters:dataset (dc.data.Dataset) – Dataset to make prediction on
Returns:
  • a NumPy array of the embeddings model produces, or a list
  • of arrays if it produces multiple embeddings
predict_on_batch(X, transformers=[], outputs=None)[source]

Generates predictions for input samples, processing samples in a batch.

Parameters:
  • X (ndarray) – the input data, as a Numpy array.
  • transformers (list of dc.trans.Transformers) – Transformers that the input data has been transformed by. The output is passed through these transformers to undo the transformations.
  • outputs (Tensor or list of Tensors) – The outputs to return. If this is None, the model’s standard prediction outputs will be returned. Alternatively one or more Tensors within the model may be specified, in which case the output of those Tensors will be returned.
Returns:

  • a NumPy array of the model produces a single output, or a list of arrays
  • if it produces multiple outputs

predict_on_generator(generator, transformers=[], outputs=None, output_types=None)[source]
Parameters:
  • generator (generator) – this should generate batches, each represented as a tuple of the form (inputs, labels, weights).
  • transformers (list of dc.trans.Transformers) – Transformers that the input data has been transformed by. The output is passed through these transformers to undo the transformations.
  • outputs (Tensor or list of Tensors) – The outputs to return. If this is None, the model’s standard prediction outputs will be returned. Alternatively one or more Tensors within the model may be specified, in which case the output of those Tensors will be returned. If outputs is specified, output_types must be None.
  • output_types (String or list of Strings) – If specified, all outputs of this type will be retrieved from the model. If output_types is specified, outputs must be None.
  • Returns – a NumPy array of the model produces a single output, or a list of arrays if it produces multiple outputs
predict_uncertainty(dataset, masks=50)[source]

Predict the model’s outputs, along with the uncertainty in each one.

The uncertainty is computed as described in https://arxiv.org/abs/1703.04977. It involves repeating the prediction many times with different dropout masks. The prediction is computed as the average over all the predictions. The uncertainty includes both the variation among the predicted values (epistemic uncertainty) and the model’s own estimates for how well it fits the data (aleatoric uncertainty). Not all models support uncertainty prediction.

Parameters:
  • dataset (dc.data.Dataset) – Dataset to make prediction on
  • masks (int) – the number of dropout masks to average over
Returns:

  • for each output, a tuple (y_pred, y_std) where y_pred is the predicted
  • value of the output, and each element of y_std estimates the standard
  • deviation of the corresponding element of y_pred

predict_uncertainty_on_batch(X, masks=50)[source]

Predict the model’s outputs, along with the uncertainty in each one.

The uncertainty is computed as described in https://arxiv.org/abs/1703.04977. It involves repeating the prediction many times with different dropout masks. The prediction is computed as the average over all the predictions. The uncertainty includes both the variation among the predicted values (epistemic uncertainty) and the model’s own estimates for how well it fits the data (aleatoric uncertainty). Not all models support uncertainty prediction.

Parameters:
  • X (ndarray) – the input data, as a Numpy array.
  • masks (int) – the number of dropout masks to average over
Returns:

  • for each output, a tuple (y_pred, y_std) where y_pred is the predicted
  • value of the output, and each element of y_std estimates the standard
  • deviation of the corresponding element of y_pred

reload()[source]

Reload trained model from disk.

restore(checkpoint=None, model_dir=None, session=None)[source]

Reload the values of all variables from a checkpoint file.

Parameters:
  • checkpoint (str) – the path to the checkpoint file to load. If this is None, the most recent checkpoint will be chosen automatically. Call get_checkpoints() to get a list of all available checkpoints.
  • model_dir (str, default None) – Directory to restore checkpoint from. If None, use self.model_dir.
  • session (tf.Session(), default None) – Session to run restore ops under. If None, self.session is used.
save()[source]

Dispatcher function for saving.

Each subclass is responsible for overriding this method.

save_checkpoint(max_checkpoints_to_keep=5, model_dir=None)[source]

Save a checkpoint to disk.

Usually you do not need to call this method, since fit() saves checkpoints automatically. If you have disabled automatic checkpointing during fitting, this can be called to manually write checkpoints.

Parameters:
  • max_checkpoints_to_keep (int) – the maximum number of checkpoints to keep. Older checkpoints are discarded.
  • model_dir (str, default None) – Model directory to save checkpoint to. If None, revert to self.model_dir
set_params(**params)[source]

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters:**params (dict) – Estimator parameters.
Returns:self – Estimator instance.
Return type:object

WeaveModel

class deepchem.models.WeaveModel(n_tasks, n_atom_feat=75, n_pair_feat=14, n_hidden=50, n_graph_feat=128, mode='classification', n_classes=2, batch_size=100, **kwargs)[source]

Implements Google-style Weave Graph Convolutions

This model implements the Weave style graph convolutions from the following paper.

Kearnes, Steven, et al. “Molecular graph convolutions: moving beyond fingerprints.” Journal of computer-aided molecular design 30.8 (2016): 595-608.

The biggest difference between WeaveModel style convolutions and GraphConvModel style convolutions is that Weave convolutions model bond features explicitly. This has the side effect that it needs to construct a NxN matrix explicitly to model bond interactions. This may cause scaling issues, but may possibly allow for better modeling of subtle bond effects.

__init__(n_tasks, n_atom_feat=75, n_pair_feat=14, n_hidden=50, n_graph_feat=128, mode='classification', n_classes=2, batch_size=100, **kwargs)[source]
Parameters:
  • n_tasks (int) – Number of tasks
  • n_atom_feat (int, optional) – Number of features per atom.
  • n_pair_feat (int, optional) – Number of features per pair of atoms.
  • n_hidden (int, optional) – Number of units(convolution depths) in corresponding hidden layer
  • n_graph_feat (int, optional) – Number of output features for each molecule(graph)
  • mode (str) – Either “classification” or “regression” for type of model.
  • n_classes (int) – Number of classes to predict (only used in classification mode)
compute_saliency(X)[source]

Compute the saliency map for an input sample.

This computes the Jacobian matrix with the derivative of each output element with respect to each input element. More precisely,

  • If this model has a single output, it returns a matrix of shape (output_shape, input_shape) with the derivatives.
  • If this model has multiple outputs, it returns a list of matrices, one for each output.

This method cannot be used on models that take multiple inputs.

Parameters:X (ndarray) – the input data for a single sample
Returns:
Return type:the Jacobian matrix, or a list of matrices
default_generator(dataset, epochs=1, mode='fit', deterministic=True, pad_batches=True)[source]

Create a generator that iterates batches for a dataset.

Subclasses may override this method to customize how model inputs are generated from the data.

Parameters:
  • dataset (Dataset) – the data to iterate
  • epochs (int) – the number of times to iterate over the full dataset
  • mode (str) – allowed values are ‘fit’ (called during training), ‘predict’ (called during prediction), and ‘uncertainty’ (called during uncertainty prediction)
  • deterministic (bool) – whether to iterate over the dataset in order, or randomly shuffle the data for each epoch
  • pad_batches (bool) – whether to pad each batch up to this model’s preferred batch size
Returns:

  • a generator that iterates batches, each represented as a tuple of lists
  • ([inputs], [outputs], [weights])

evaluate(dataset, metrics, transformers=[], per_task_metrics=False)[source]

Evaluates the performance of this model on specified dataset.

Parameters:
  • dataset (dc.data.Dataset) – Dataset object.
  • metric (deepchem.metrics.Metric) – Evaluation metric
  • transformers (list) – List of deepchem.transformers.Transformer
  • per_task_metrics (bool) – If True, return per-task scores.
Returns:

Maps tasks to scores under metric.

Return type:

dict

evaluate_generator(generator, metrics, transformers=[], per_task_metrics=False)[source]

Evaluate the performance of this model on the data produced by a generator.

Parameters:
  • generator (generator) – this should generate batches, each represented as a tuple of the form (inputs, labels, weights).
  • metric (deepchem.metrics.Metric) – Evaluation metric
  • transformers (list of dc.trans.Transformers) – Transformers that the input data has been transformed by. The output is passed through these transformers to undo the transformations.
  • per_task_metrics (bool) – If True, return per-task scores.
Returns:

Maps tasks to scores under metric.

Return type:

dict

fit(dataset, nb_epoch=10, max_checkpoints_to_keep=5, checkpoint_interval=1000, deterministic=False, restore=False, variables=None, loss=None, callbacks=[])[source]

Train this model on a dataset.

Parameters:
  • dataset (Dataset) – the Dataset to train on
  • nb_epoch (int) – the number of epochs to train for
  • max_checkpoints_to_keep (int) – the maximum number of checkpoints to keep. Older checkpoints are discarded.
  • checkpoint_interval (int) – the frequency at which to write checkpoints, measured in training steps. Set this to 0 to disable automatic checkpointing.
  • deterministic (bool) – if True, the samples are processed in order. If False, a different random order is used for each epoch.
  • restore (bool) – if True, restore the model from the most recent checkpoint and continue training from there. If False, retrain the model from scratch.
  • variables (list of tf.Variable) – the variables to train. If None (the default), all trainable variables in the model are used.
  • loss (function) – a function of the form f(outputs, labels, weights) that computes the loss for each batch. If None (the default), the model’s standard loss function is used.
  • callbacks (function or list of functions) – one or more functions of the form f(model, step) that will be invoked after every step. This can be used to perform validation, logging, etc.
fit_generator(generator, max_checkpoints_to_keep=5, checkpoint_interval=1000, restore=False, variables=None, loss=None, callbacks=[])[source]

Train this model on data from a generator.

Parameters:
  • generator (generator) – this should generate batches, each represented as a tuple of the form (inputs, labels, weights).
  • max_checkpoints_to_keep (int) – the maximum number of checkpoints to keep. Older checkpoints are discarded.
  • checkpoint_interval (int) – the frequency at which to write checkpoints, measured in training steps. Set this to 0 to disable automatic checkpointing.
  • restore (bool) – if True, restore the model from the most recent checkpoint and continue training from there. If False, retrain the model from scratch.
  • variables (list of tf.Variable) – the variables to train. If None (the default), all trainable variables in the model are used.
  • loss (function) – a function of the form f(outputs, labels, weights) that computes the loss for each batch. If None (the default), the model’s standard loss function is used.
  • callbacks (function or list of functions) – one or more functions of the form f(model, step) that will be invoked after every step. This can be used to perform validation, logging, etc.
Returns:

Return type:

the average loss over the most recent checkpoint interval

fit_on_batch(X, y, w, variables=None, loss=None, callbacks=[], checkpoint=True, max_checkpoints_to_keep=5)[source]

Perform a single step of training.

Parameters:
  • X (ndarray) – the inputs for the batch
  • y (ndarray) – the labels for the batch
  • w (ndarray) – the weights for the batch
  • variables (list of tf.Variable) – the variables to train. If None (the default), all trainable variables in the model are used.
  • loss (function) – a function of the form f(outputs, labels, weights) that computes the loss for each batch. If None (the default), the model’s standard loss function is used.
  • callbacks (function or list of functions) – one or more functions of the form f(model, step) that will be invoked after every step. This can be used to perform validation, logging, etc.
  • checkpoint (bool) – if true, save a checkpoint after performing the training step
  • max_checkpoints_to_keep (int) – the maximum number of checkpoints to keep. Older checkpoints are discarded.
get_checkpoints(model_dir=None)[source]

Get a list of all available checkpoint files.

Parameters:model_dir (str, default None) – Directory to get list of checkpoints from. Reverts to self.model_dir if None
get_global_step()[source]

Get the number of steps of fitting that have been performed.

static get_model_filename(model_dir)[source]

Given model directory, obtain filename for the model itself.

get_num_tasks()[source]

Get number of tasks.

get_params(deep=True)[source]

Get parameters for this estimator.

Parameters:deep (bool, default=True) – If True, will return the parameters for this estimator and contained subobjects that are estimators.
Returns:params – Parameter names mapped to their values.
Return type:mapping of string to any
static get_params_filename(model_dir)[source]

Given model directory, obtain filename for the model itself.

get_task_type()[source]

Currently models can only be classifiers or regressors.

load_from_pretrained(source_model, assignment_map=None, value_map=None, checkpoint=None, model_dir=None, include_top=True, inputs=None, **kwargs)[source]

Copies variable values from a pretrained model. source_model can either be a pretrained model or a model with the same architecture. value_map is a variable-value dictionary. If no value_map is provided, the variable values are restored to the source_model from a checkpoint and a default value_map is created. assignment_map is a dictionary mapping variables from the source_model to the current model. If no assignment_map is provided, one is made from scratch and assumes the model is composed of several different layers, with the final one being a dense layer. include_top is used to control whether or not the final dense layer is used. The default assignment map is useful in cases where the type of task is different (classification vs regression) and/or number of tasks in the setting.

Parameters:
  • source_model (dc.KerasModel, required) – source_model can either be the pretrained model or a dc.KerasModel with the same architecture as the pretrained model. It is used to restore from a checkpoint, if value_map is None and to create a default assignment map if assignment_map is None
  • assignment_map (Dict, default None) – Dictionary mapping the source_model variables and current model variables
  • value_map (Dict, default None) – Dictionary containing source_model trainable variables mapped to numpy arrays. If value_map is None, the values are restored and a default variable map is created using the restored values
  • checkpoint (str, default None) – the path to the checkpoint file to load. If this is None, the most recent checkpoint will be chosen automatically. Call get_checkpoints() to get a list of all available checkpoints
  • model_dir (str, default None) – Restore model from custom model directory if needed
  • include_top (bool, default True) – if True, copies the weights and bias associated with the final dense layer. Used only when assignment map is None
  • inputs (List, input tensors for model) – if not None, then the weights are built for both the source and self. This option is useful only for models that are built by subclassing tf.keras.Model, and not using the functional API by tf.keras
predict(dataset, transformers=[], outputs=None, output_types=None)[source]

Uses self to make predictions on provided Dataset object.

Parameters:
  • dataset (dc.data.Dataset) – Dataset to make prediction on
  • transformers (list of dc.trans.Transformers) – Transformers that the input data has been transformed by. The output is passed through these transformers to undo the transformations.
  • outputs (Tensor or list of Tensors) – The outputs to return. If this is None, the model’s standard prediction outputs will be returned. Alternatively one or more Tensors within the model may be specified, in which case the output of those Tensors will be returned.
  • output_types (list of Strings) – The output types to return. Will retrieve all outputs of these types from the model.
Returns:

  • a NumPy array of the model produces a single output, or a list of arrays
  • if it produces multiple outputs

predict_embedding(dataset)[source]

Predicts embeddings created by underlying model if any exist. An embedding must be specified to have output_type of ‘embedding’ in the model definition.

Parameters:dataset (dc.data.Dataset) – Dataset to make prediction on
Returns:
  • a NumPy array of the embeddings model produces, or a list
  • of arrays if it produces multiple embeddings
predict_on_batch(X, transformers=[], outputs=None)[source]

Generates predictions for input samples, processing samples in a batch.

Parameters:
  • X (ndarray) – the input data, as a Numpy array.
  • transformers (list of dc.trans.Transformers) – Transformers that the input data has been transformed by. The output is passed through these transformers to undo the transformations.
  • outputs (Tensor or list of Tensors) – The outputs to return. If this is None, the model’s standard prediction outputs will be returned. Alternatively one or more Tensors within the model may be specified, in which case the output of those Tensors will be returned.
Returns:

  • a NumPy array of the model produces a single output, or a list of arrays
  • if it produces multiple outputs

predict_on_generator(generator, transformers=[], outputs=None, output_types=None)[source]
Parameters:
  • generator (generator) – this should generate batches, each represented as a tuple of the form (inputs, labels, weights).
  • transformers (list of dc.trans.Transformers) – Transformers that the input data has been transformed by. The output is passed through these transformers to undo the transformations.
  • outputs (Tensor or list of Tensors) – The outputs to return. If this is None, the model’s standard prediction outputs will be returned. Alternatively one or more Tensors within the model may be specified, in which case the output of those Tensors will be returned. If outputs is specified, output_types must be None.
  • output_types (String or list of Strings) – If specified, all outputs of this type will be retrieved from the model. If output_types is specified, outputs must be None.
  • Returns – a NumPy array of the model produces a single output, or a list of arrays if it produces multiple outputs
predict_uncertainty(dataset, masks=50)[source]

Predict the model’s outputs, along with the uncertainty in each one.

The uncertainty is computed as described in https://arxiv.org/abs/1703.04977. It involves repeating the prediction many times with different dropout masks. The prediction is computed as the average over all the predictions. The uncertainty includes both the variation among the predicted values (epistemic uncertainty) and the model’s own estimates for how well it fits the data (aleatoric uncertainty). Not all models support uncertainty prediction.

Parameters:
  • dataset (dc.data.Dataset) – Dataset to make prediction on
  • masks (int) – the number of dropout masks to average over
Returns:

  • for each output, a tuple (y_pred, y_std) where y_pred is the predicted
  • value of the output, and each element of y_std estimates the standard
  • deviation of the corresponding element of y_pred

predict_uncertainty_on_batch(X, masks=50)[source]

Predict the model’s outputs, along with the uncertainty in each one.

The uncertainty is computed as described in https://arxiv.org/abs/1703.04977. It involves repeating the prediction many times with different dropout masks. The prediction is computed as the average over all the predictions. The uncertainty includes both the variation among the predicted values (epistemic uncertainty) and the model’s own estimates for how well it fits the data (aleatoric uncertainty). Not all models support uncertainty prediction.

Parameters:
  • X (ndarray) – the input data, as a Numpy array.
  • masks (int) – the number of dropout masks to average over
Returns:

  • for each output, a tuple (y_pred, y_std) where y_pred is the predicted
  • value of the output, and each element of y_std estimates the standard
  • deviation of the corresponding element of y_pred

reload()[source]

Reload trained model from disk.

restore(checkpoint=None, model_dir=None, session=None)[source]

Reload the values of all variables from a checkpoint file.

Parameters:
  • checkpoint (str) – the path to the checkpoint file to load. If this is None, the most recent checkpoint will be chosen automatically. Call get_checkpoints() to get a list of all available checkpoints.
  • model_dir (str, default None) – Directory to restore checkpoint from. If None, use self.model_dir.
  • session (tf.Session(), default None) – Session to run restore ops under. If None, self.session is used.
save()[source]

Dispatcher function for saving.

Each subclass is responsible for overriding this method.

save_checkpoint(max_checkpoints_to_keep=5, model_dir=None)[source]

Save a checkpoint to disk.

Usually you do not need to call this method, since fit() saves checkpoints automatically. If you have disabled automatic checkpointing during fitting, this can be called to manually write checkpoints.

Parameters:
  • max_checkpoints_to_keep (int) – the maximum number of checkpoints to keep. Older checkpoints are discarded.
  • model_dir (str, default None) – Model directory to save checkpoint to. If None, revert to self.model_dir
set_params(**params)[source]

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters:**params (dict) – Estimator parameters.
Returns:self – Estimator instance.
Return type:object

DTNNModel

class deepchem.models.DTNNModel(n_tasks, n_embedding=30, n_hidden=100, n_distance=100, distance_min=-1, distance_max=18, output_activation=True, mode='regression', dropout=0.0, **kwargs)[source]

Deep Tensor Neural Networks

This class implements deep tensor neural networks as first defined in

Schütt, Kristof T., et al. “Quantum-chemical insights from deep tensor neural networks.” Nature communications 8.1 (2017): 1-8.

__init__(n_tasks, n_embedding=30, n_hidden=100, n_distance=100, distance_min=-1, distance_max=18, output_activation=True, mode='regression', dropout=0.0, **kwargs)[source]
Parameters:
  • n_tasks (int) – Number of tasks
  • n_embedding (int, optional) – Number of features per atom.
  • n_hidden (int, optional) – Number of features for each molecule after DTNNStep
  • n_distance (int, optional) – granularity of distance matrix step size will be (distance_max-distance_min)/n_distance
  • distance_min (float, optional) – minimum distance of atom pairs, default = -1 Angstorm
  • distance_max (float, optional) – maximum distance of atom pairs, default = 18 Angstorm
  • mode (str) – Only “regression” is currently supported.
  • dropout (float) – the dropout probablity to use.
compute_features_on_batch(X_b)[source]

Computes the values for different Feature Layers on given batch

A tf.py_func wrapper is written around this when creating the input_fn for tf.Estimator

compute_saliency(X)[source]

Compute the saliency map for an input sample.

This computes the Jacobian matrix with the derivative of each output element with respect to each input element. More precisely,

  • If this model has a single output, it returns a matrix of shape (output_shape, input_shape) with the derivatives.
  • If this model has multiple outputs, it returns a list of matrices, one for each output.

This method cannot be used on models that take multiple inputs.

Parameters:X (ndarray) – the input data for a single sample
Returns:
Return type:the Jacobian matrix, or a list of matrices
default_generator(dataset, epochs=1, mode='fit', deterministic=True, pad_batches=True)[source]

Create a generator that iterates batches for a dataset.

Subclasses may override this method to customize how model inputs are generated from the data.

Parameters:
  • dataset (Dataset) – the data to iterate
  • epochs (int) – the number of times to iterate over the full dataset
  • mode (str) – allowed values are ‘fit’ (called during training), ‘predict’ (called during prediction), and ‘uncertainty’ (called during uncertainty prediction)
  • deterministic (bool) – whether to iterate over the dataset in order, or randomly shuffle the data for each epoch
  • pad_batches (bool) – whether to pad each batch up to this model’s preferred batch size
Returns:

  • a generator that iterates batches, each represented as a tuple of lists
  • ([inputs], [outputs], [weights])

evaluate(dataset, metrics, transformers=[], per_task_metrics=False)[source]

Evaluates the performance of this model on specified dataset.

Parameters:
  • dataset (dc.data.Dataset) – Dataset object.
  • metric (deepchem.metrics.Metric) – Evaluation metric
  • transformers (list) – List of deepchem.transformers.Transformer
  • per_task_metrics (bool) – If True, return per-task scores.
Returns:

Maps tasks to scores under metric.

Return type:

dict

evaluate_generator(generator, metrics, transformers=[], per_task_metrics=False)[source]

Evaluate the performance of this model on the data produced by a generator.

Parameters:
  • generator (generator) – this should generate batches, each represented as a tuple of the form (inputs, labels, weights).
  • metric (deepchem.metrics.Metric) – Evaluation metric
  • transformers (list of dc.trans.Transformers) – Transformers that the input data has been transformed by. The output is passed through these transformers to undo the transformations.
  • per_task_metrics (bool) – If True, return per-task scores.
Returns:

Maps tasks to scores under metric.

Return type:

dict

fit(dataset, nb_epoch=10, max_checkpoints_to_keep=5, checkpoint_interval=1000, deterministic=False, restore=False, variables=None, loss=None, callbacks=[])[source]

Train this model on a dataset.

Parameters:
  • dataset (Dataset) – the Dataset to train on
  • nb_epoch (int) – the number of epochs to train for
  • max_checkpoints_to_keep (int) – the maximum number of checkpoints to keep. Older checkpoints are discarded.
  • checkpoint_interval (int) – the frequency at which to write checkpoints, measured in training steps. Set this to 0 to disable automatic checkpointing.
  • deterministic (bool) – if True, the samples are processed in order. If False, a different random order is used for each epoch.
  • restore (bool) – if True, restore the model from the most recent checkpoint and continue training from there. If False, retrain the model from scratch.
  • variables (list of tf.Variable) – the variables to train. If None (the default), all trainable variables in the model are used.
  • loss (function) – a function of the form f(outputs, labels, weights) that computes the loss for each batch. If None (the default), the model’s standard loss function is used.
  • callbacks (function or list of functions) – one or more functions of the form f(model, step) that will be invoked after every step. This can be used to perform validation, logging, etc.
fit_generator(generator, max_checkpoints_to_keep=5, checkpoint_interval=1000, restore=False, variables=None, loss=None, callbacks=[])[source]

Train this model on data from a generator.

Parameters:
  • generator (generator) – this should generate batches, each represented as a tuple of the form (inputs, labels, weights).
  • max_checkpoints_to_keep (int) – the maximum number of checkpoints to keep. Older checkpoints are discarded.
  • checkpoint_interval (int) – the frequency at which to write checkpoints, measured in training steps. Set this to 0 to disable automatic checkpointing.
  • restore (bool) – if True, restore the model from the most recent checkpoint and continue training from there. If False, retrain the model from scratch.
  • variables (list of tf.Variable) – the variables to train. If None (the default), all trainable variables in the model are used.
  • loss (function) – a function of the form f(outputs, labels, weights) that computes the loss for each batch. If None (the default), the model’s standard loss function is used.
  • callbacks (function or list of functions) – one or more functions of the form f(model, step) that will be invoked after every step. This can be used to perform validation, logging, etc.
Returns:

Return type:

the average loss over the most recent checkpoint interval

fit_on_batch(X, y, w, variables=None, loss=None, callbacks=[], checkpoint=True, max_checkpoints_to_keep=5)[source]

Perform a single step of training.

Parameters:
  • X (ndarray) – the inputs for the batch
  • y (ndarray) – the labels for the batch
  • w (ndarray) – the weights for the batch
  • variables (list of tf.Variable) – the variables to train. If None (the default), all trainable variables in the model are used.
  • loss (function) – a function of the form f(outputs, labels, weights) that computes the loss for each batch. If None (the default), the model’s standard loss function is used.
  • callbacks (function or list of functions) – one or more functions of the form f(model, step) that will be invoked after every step. This can be used to perform validation, logging, etc.
  • checkpoint (bool) – if true, save a checkpoint after performing the training step
  • max_checkpoints_to_keep (int) – the maximum number of checkpoints to keep. Older checkpoints are discarded.
get_checkpoints(model_dir=None)[source]

Get a list of all available checkpoint files.

Parameters:model_dir (str, default None) – Directory to get list of checkpoints from. Reverts to self.model_dir if None
get_global_step()[source]

Get the number of steps of fitting that have been performed.

static get_model_filename(model_dir)[source]

Given model directory, obtain filename for the model itself.

get_num_tasks()[source]

Get number of tasks.

get_params(deep=True)[source]

Get parameters for this estimator.

Parameters:deep (bool, default=True) – If True, will return the parameters for this estimator and contained subobjects that are estimators.
Returns:params – Parameter names mapped to their values.
Return type:mapping of string to any
static get_params_filename(model_dir)[source]

Given model directory, obtain filename for the model itself.

get_task_type()[source]

Currently models can only be classifiers or regressors.

load_from_pretrained(source_model, assignment_map=None, value_map=None, checkpoint=None, model_dir=None, include_top=True, inputs=None, **kwargs)[source]

Copies variable values from a pretrained model. source_model can either be a pretrained model or a model with the same architecture. value_map is a variable-value dictionary. If no value_map is provided, the variable values are restored to the source_model from a checkpoint and a default value_map is created. assignment_map is a dictionary mapping variables from the source_model to the current model. If no assignment_map is provided, one is made from scratch and assumes the model is composed of several different layers, with the final one being a dense layer. include_top is used to control whether or not the final dense layer is used. The default assignment map is useful in cases where the type of task is different (classification vs regression) and/or number of tasks in the setting.

Parameters:
  • source_model (dc.KerasModel, required) – source_model can either be the pretrained model or a dc.KerasModel with the same architecture as the pretrained model. It is used to restore from a checkpoint, if value_map is None and to create a default assignment map if assignment_map is None
  • assignment_map (Dict, default None) – Dictionary mapping the source_model variables and current model variables
  • value_map (Dict, default None) – Dictionary containing source_model trainable variables mapped to numpy arrays. If value_map is None, the values are restored and a default variable map is created using the restored values
  • checkpoint (str, default None) – the path to the checkpoint file to load. If this is None, the most recent checkpoint will be chosen automatically. Call get_checkpoints() to get a list of all available checkpoints
  • model_dir (str, default None) – Restore model from custom model directory if needed
  • include_top (bool, default True) – if True, copies the weights and bias associated with the final dense layer. Used only when assignment map is None
  • inputs (List, input tensors for model) – if not None, then the weights are built for both the source and self. This option is useful only for models that are built by subclassing tf.keras.Model, and not using the functional API by tf.keras
predict(dataset, transformers=[], outputs=None, output_types=None)[source]

Uses self to make predictions on provided Dataset object.

Parameters:
  • dataset (dc.data.Dataset) – Dataset to make prediction on
  • transformers (list of dc.trans.Transformers) – Transformers that the input data has been transformed by. The output is passed through these transformers to undo the transformations.
  • outputs (Tensor or list of Tensors) – The outputs to return. If this is None, the model’s standard prediction outputs will be returned. Alternatively one or more Tensors within the model may be specified, in which case the output of those Tensors will be returned.
  • output_types (list of Strings) – The output types to return. Will retrieve all outputs of these types from the model.
Returns:

  • a NumPy array of the model produces a single output, or a list of arrays
  • if it produces multiple outputs

predict_embedding(dataset)[source]

Predicts embeddings created by underlying model if any exist. An embedding must be specified to have output_type of ‘embedding’ in the model definition.

Parameters:dataset (dc.data.Dataset) – Dataset to make prediction on
Returns:
  • a NumPy array of the embeddings model produces, or a list
  • of arrays if it produces multiple embeddings
predict_on_batch(X, transformers=[], outputs=None)[source]

Generates predictions for input samples, processing samples in a batch.

Parameters:
  • X (ndarray) – the input data, as a Numpy array.
  • transformers (list of dc.trans.Transformers) – Transformers that the input data has been transformed by. The output is passed through these transformers to undo the transformations.
  • outputs (Tensor or list of Tensors) – The outputs to return. If this is None, the model’s standard prediction outputs will be returned. Alternatively one or more Tensors within the model may be specified, in which case the output of those Tensors will be returned.
Returns:

  • a NumPy array of the model produces a single output, or a list of arrays
  • if it produces multiple outputs

predict_on_generator(generator, transformers=[], outputs=None, output_types=None)[source]
Parameters:
  • generator (generator) – this should generate batches, each represented as a tuple of the form (inputs, labels, weights).
  • transformers (list of dc.trans.Transformers) – Transformers that the input data has been transformed by. The output is passed through these transformers to undo the transformations.
  • outputs (Tensor or list of Tensors) – The outputs to return. If this is None, the model’s standard prediction outputs will be returned. Alternatively one or more Tensors within the model may be specified, in which case the output of those Tensors will be returned. If outputs is specified, output_types must be None.
  • output_types (String or list of Strings) – If specified, all outputs of this type will be retrieved from the model. If output_types is specified, outputs must be None.
  • Returns – a NumPy array of the model produces a single output, or a list of arrays if it produces multiple outputs
predict_uncertainty(dataset, masks=50)[source]

Predict the model’s outputs, along with the uncertainty in each one.

The uncertainty is computed as described in https://arxiv.org/abs/1703.04977. It involves repeating the prediction many times with different dropout masks. The prediction is computed as the average over all the predictions. The uncertainty includes both the variation among the predicted values (epistemic uncertainty) and the model’s own estimates for how well it fits the data (aleatoric uncertainty). Not all models support uncertainty prediction.

Parameters:
  • dataset (dc.data.Dataset) – Dataset to make prediction on
  • masks (int) – the number of dropout masks to average over
Returns:

  • for each output, a tuple (y_pred, y_std) where y_pred is the predicted
  • value of the output, and each element of y_std estimates the standard
  • deviation of the corresponding element of y_pred

predict_uncertainty_on_batch(X, masks=50)[source]

Predict the model’s outputs, along with the uncertainty in each one.

The uncertainty is computed as described in https://arxiv.org/abs/1703.04977. It involves repeating the prediction many times with different dropout masks. The prediction is computed as the average over all the predictions. The uncertainty includes both the variation among the predicted values (epistemic uncertainty) and the model’s own estimates for how well it fits the data (aleatoric uncertainty). Not all models support uncertainty prediction.

Parameters:
  • X (ndarray) – the input data, as a Numpy array.
  • masks (int) – the number of dropout masks to average over
Returns:

  • for each output, a tuple (y_pred, y_std) where y_pred is the predicted
  • value of the output, and each element of y_std estimates the standard
  • deviation of the corresponding element of y_pred

reload()[source]

Reload trained model from disk.

restore(checkpoint=None, model_dir=None, session=None)[source]

Reload the values of all variables from a checkpoint file.

Parameters:
  • checkpoint (str) – the path to the checkpoint file to load. If this is None, the most recent checkpoint will be chosen automatically. Call get_checkpoints() to get a list of all available checkpoints.
  • model_dir (str, default None) – Directory to restore checkpoint from. If None, use self.model_dir.
  • session (tf.Session(), default None) – Session to run restore ops under. If None, self.session is used.
save()[source]

Dispatcher function for saving.

Each subclass is responsible for overriding this method.

save_checkpoint(max_checkpoints_to_keep=5, model_dir=None)[source]

Save a checkpoint to disk.

Usually you do not need to call this method, since fit() saves checkpoints automatically. If you have disabled automatic checkpointing during fitting, this can be called to manually write checkpoints.

Parameters:
  • max_checkpoints_to_keep (int) – the maximum number of checkpoints to keep. Older checkpoints are discarded.
  • model_dir (str, default None) – Model directory to save checkpoint to. If None, revert to self.model_dir
set_params(**params)[source]

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters:**params (dict) – Estimator parameters.
Returns:self – Estimator instance.
Return type:object

DAGModel

class deepchem.models.DAGModel(n_tasks, max_atoms=50, n_atom_feat=75, n_graph_feat=30, n_outputs=30, layer_sizes=[100], layer_sizes_gather=[100], dropout=None, mode='classification', n_classes=2, uncertainty=False, batch_size=100, **kwargs)[source]

Directed Acyclic Graph models for molecular property prediction.

This model is based on the following paper:

Lusci, Alessandro, Gianluca Pollastri, and Pierre Baldi. “Deep architectures and deep learning in chemoinformatics: the prediction of aqueous solubility for drug-like molecules.” Journal of chemical information and modeling 53.7 (2013): 1563-1575.

The basic idea for this paper is that a molecule is usually viewed as an undirected graph. However, you can convert it to a series of directed graphs. The idea is that for each atom, you make a DAG using that atom as the vertex of the DAG and edges pointing “inwards” to it. This transformation is implemented in dc.trans.transformers.DAGTransformer.UG_to_DAG.

This model accepts ConvMols as input, just as GraphConvModel does, but these ConvMol objects must be transformed by dc.trans.DAGTransformer.

As a note, performance of this model can be a little sensitive to initialization. It might be worth training a few different instantiations to get a stable set of parameters.

__init__(n_tasks, max_atoms=50, n_atom_feat=75, n_graph_feat=30, n_outputs=30, layer_sizes=[100], layer_sizes_gather=[100], dropout=None, mode='classification', n_classes=2, uncertainty=False, batch_size=100, **kwargs)[source]
Parameters:
  • n_tasks (int) – Number of tasks.
  • max_atoms (int, optional) – Maximum number of atoms in a molecule, should be defined based on dataset.
  • n_atom_feat (int, optional) – Number of features per atom.
  • n_graph_feat (int, optional) – Number of features for atom in the graph.
  • n_outputs (int, optional) – Number of features for each molecule.
  • layer_sizes (list of int, optional) – List of hidden layer size(s) in the propagation step: length of this list represents the number of hidden layers, and each element is the width of corresponding hidden layer.
  • layer_sizes_gather (list of int, optional) – List of hidden layer size(s) in the gather step.
  • dropout (None or float, optional) – Dropout probability, applied after each propagation step and gather step.
  • mode (str, optional) – Either “classification” or “regression” for type of model.
  • n_classes (int) – the number of classes to predict (only used in classification mode)
  • uncertainty (bool) – if True, include extra outputs and loss terms to enable the uncertainty in outputs to be predicted
compute_saliency(X)[source]

Compute the saliency map for an input sample.

This computes the Jacobian matrix with the derivative of each output element with respect to each input element. More precisely,

  • If this model has a single output, it returns a matrix of shape (output_shape, input_shape) with the derivatives.
  • If this model has multiple outputs, it returns a list of matrices, one for each output.

This method cannot be used on models that take multiple inputs.

Parameters:X (ndarray) – the input data for a single sample
Returns:
Return type:the Jacobian matrix, or a list of matrices
default_generator(dataset, epochs=1, mode='fit', deterministic=True, pad_batches=True)[source]

TensorGraph style implementation

evaluate(dataset, metrics, transformers=[], per_task_metrics=False)[source]

Evaluates the performance of this model on specified dataset.

Parameters:
  • dataset (dc.data.Dataset) – Dataset object.
  • metric (deepchem.metrics.Metric) – Evaluation metric
  • transformers (list) – List of deepchem.transformers.Transformer
  • per_task_metrics (bool) – If True, return per-task scores.
Returns:

Maps tasks to scores under metric.

Return type:

dict

evaluate_generator(generator, metrics, transformers=[], per_task_metrics=False)[source]

Evaluate the performance of this model on the data produced by a generator.

Parameters:
  • generator (generator) – this should generate batches, each represented as a tuple of the form (inputs, labels, weights).
  • metric (deepchem.metrics.Metric) – Evaluation metric
  • transformers (list of dc.trans.Transformers) – Transformers that the input data has been transformed by. The output is passed through these transformers to undo the transformations.
  • per_task_metrics (bool) – If True, return per-task scores.
Returns:

Maps tasks to scores under metric.

Return type:

dict

fit(dataset, nb_epoch=10, max_checkpoints_to_keep=5, checkpoint_interval=1000, deterministic=False, restore=False, variables=None, loss=None, callbacks=[])[source]

Train this model on a dataset.

Parameters:
  • dataset (Dataset) – the Dataset to train on
  • nb_epoch (int) – the number of epochs to train for
  • max_checkpoints_to_keep (int) – the maximum number of checkpoints to keep. Older checkpoints are discarded.
  • checkpoint_interval (int) – the frequency at which to write checkpoints, measured in training steps. Set this to 0 to disable automatic checkpointing.
  • deterministic (bool) – if True, the samples are processed in order. If False, a different random order is used for each epoch.
  • restore (bool) – if True, restore the model from the most recent checkpoint and continue training from there. If False, retrain the model from scratch.
  • variables (list of tf.Variable) – the variables to train. If None (the default), all trainable variables in the model are used.
  • loss (function) – a function of the form f(outputs, labels, weights) that computes the loss for each batch. If None (the default), the model’s standard loss function is used.
  • callbacks (function or list of functions) – one or more functions of the form f(model, step) that will be invoked after every step. This can be used to perform validation, logging, etc.
fit_generator(generator, max_checkpoints_to_keep=5, checkpoint_interval=1000, restore=False, variables=None, loss=None, callbacks=[])[source]

Train this model on data from a generator.

Parameters:
  • generator (generator) – this should generate batches, each represented as a tuple of the form (inputs, labels, weights).
  • max_checkpoints_to_keep (int) – the maximum number of checkpoints to keep. Older checkpoints are discarded.
  • checkpoint_interval (int) – the frequency at which to write checkpoints, measured in training steps. Set this to 0 to disable automatic checkpointing.
  • restore (bool) – if True, restore the model from the most recent checkpoint and continue training from there. If False, retrain the model from scratch.
  • variables (list of tf.Variable) – the variables to train. If None (the default), all trainable variables in the model are used.
  • loss (function) – a function of the form f(outputs, labels, weights) that computes the loss for each batch. If None (the default), the model’s standard loss function is used.
  • callbacks (function or list of functions) – one or more functions of the form f(model, step) that will be invoked after every step. This can be used to perform validation, logging, etc.
Returns:

Return type:

the average loss over the most recent checkpoint interval

fit_on_batch(X, y, w, variables=None, loss=None, callbacks=[], checkpoint=True, max_checkpoints_to_keep=5)[source]

Perform a single step of training.

Parameters:
  • X (ndarray) – the inputs for the batch
  • y (ndarray) – the labels for the batch
  • w (ndarray) – the weights for the batch
  • variables (list of tf.Variable) – the variables to train. If None (the default), all trainable variables in the model are used.
  • loss (function) – a function of the form f(outputs, labels, weights) that computes the loss for each batch. If None (the default), the model’s standard loss function is used.
  • callbacks (function or list of functions) – one or more functions of the form f(model, step) that will be invoked after every step. This can be used to perform validation, logging, etc.
  • checkpoint (bool) – if true, save a checkpoint after performing the training step
  • max_checkpoints_to_keep (int) – the maximum number of checkpoints to keep. Older checkpoints are discarded.
get_checkpoints(model_dir=None)[source]

Get a list of all available checkpoint files.

Parameters:model_dir (str, default None) – Directory to get list of checkpoints from. Reverts to self.model_dir if None
get_global_step()[source]

Get the number of steps of fitting that have been performed.

static get_model_filename(model_dir)[source]

Given model directory, obtain filename for the model itself.

get_num_tasks()[source]

Get number of tasks.

get_params(deep=True)[source]

Get parameters for this estimator.

Parameters:deep (bool, default=True) – If True, will return the parameters for this estimator and contained subobjects that are estimators.
Returns:params – Parameter names mapped to their values.
Return type:mapping of string to any
static get_params_filename(model_dir)[source]

Given model directory, obtain filename for the model itself.

get_task_type()[source]

Currently models can only be classifiers or regressors.

load_from_pretrained(source_model, assignment_map=None, value_map=None, checkpoint=None, model_dir=None, include_top=True, inputs=None, **kwargs)[source]

Copies variable values from a pretrained model. source_model can either be a pretrained model or a model with the same architecture. value_map is a variable-value dictionary. If no value_map is provided, the variable values are restored to the source_model from a checkpoint and a default value_map is created. assignment_map is a dictionary mapping variables from the source_model to the current model. If no assignment_map is provided, one is made from scratch and assumes the model is composed of several different layers, with the final one being a dense layer. include_top is used to control whether or not the final dense layer is used. The default assignment map is useful in cases where the type of task is different (classification vs regression) and/or number of tasks in the setting.

Parameters:
  • source_model (dc.KerasModel, required) – source_model can either be the pretrained model or a dc.KerasModel with the same architecture as the pretrained model. It is used to restore from a checkpoint, if value_map is None and to create a default assignment map if assignment_map is None
  • assignment_map (Dict, default None) – Dictionary mapping the source_model variables and current model variables
  • value_map (Dict, default None) – Dictionary containing source_model trainable variables mapped to numpy arrays. If value_map is None, the values are restored and a default variable map is created using the restored values
  • checkpoint (str, default None) – the path to the checkpoint file to load. If this is None, the most recent checkpoint will be chosen automatically. Call get_checkpoints() to get a list of all available checkpoints
  • model_dir (str, default None) – Restore model from custom model directory if needed
  • include_top (bool, default True) – if True, copies the weights and bias associated with the final dense layer. Used only when assignment map is None
  • inputs (List, input tensors for model) – if not None, then the weights are built for both the source and self. This option is useful only for models that are built by subclassing tf.keras.Model, and not using the functional API by tf.keras
predict(dataset, transformers=[], outputs=None, output_types=None)[source]

Uses self to make predictions on provided Dataset object.

Parameters:
  • dataset (dc.data.Dataset) – Dataset to make prediction on
  • transformers (list of dc.trans.Transformers) – Transformers that the input data has been transformed by. The output is passed through these transformers to undo the transformations.
  • outputs (Tensor or list of Tensors) – The outputs to return. If this is None, the model’s standard prediction outputs will be returned. Alternatively one or more Tensors within the model may be specified, in which case the output of those Tensors will be returned.
  • output_types (list of Strings) – The output types to return. Will retrieve all outputs of these types from the model.
Returns:

  • a NumPy array of the model produces a single output, or a list of arrays
  • if it produces multiple outputs

predict_embedding(dataset)[source]

Predicts embeddings created by underlying model if any exist. An embedding must be specified to have output_type of ‘embedding’ in the model definition.

Parameters:dataset (dc.data.Dataset) – Dataset to make prediction on
Returns:
  • a NumPy array of the embeddings model produces, or a list
  • of arrays if it produces multiple embeddings
predict_on_batch(X, transformers=[], outputs=None)[source]

Generates predictions for input samples, processing samples in a batch.

Parameters:
  • X (ndarray) – the input data, as a Numpy array.
  • transformers (list of dc.trans.Transformers) – Transformers that the input data has been transformed by. The output is passed through these transformers to undo the transformations.
  • outputs (Tensor or list of Tensors) – The outputs to return. If this is None, the model’s standard prediction outputs will be returned. Alternatively one or more Tensors within the model may be specified, in which case the output of those Tensors will be returned.
Returns:

  • a NumPy array of the model produces a single output, or a list of arrays
  • if it produces multiple outputs

predict_on_generator(generator, transformers=[], outputs=None, output_types=None)[source]
Parameters:
  • generator (generator) – this should generate batches, each represented as a tuple of the form (inputs, labels, weights).
  • transformers (list of dc.trans.Transformers) – Transformers that the input data has been transformed by. The output is passed through these transformers to undo the transformations.
  • outputs (Tensor or list of Tensors) – The outputs to return. If this is None, the model’s standard prediction outputs will be returned. Alternatively one or more Tensors within the model may be specified, in which case the output of those Tensors will be returned. If outputs is specified, output_types must be None.
  • output_types (String or list of Strings) – If specified, all outputs of this type will be retrieved from the model. If output_types is specified, outputs must be None.
  • Returns – a NumPy array of the model produces a single output, or a list of arrays if it produces multiple outputs
predict_uncertainty(dataset, masks=50)[source]

Predict the model’s outputs, along with the uncertainty in each one.

The uncertainty is computed as described in https://arxiv.org/abs/1703.04977. It involves repeating the prediction many times with different dropout masks. The prediction is computed as the average over all the predictions. The uncertainty includes both the variation among the predicted values (epistemic uncertainty) and the model’s own estimates for how well it fits the data (aleatoric uncertainty). Not all models support uncertainty prediction.

Parameters:
  • dataset (dc.data.Dataset) – Dataset to make prediction on
  • masks (int) – the number of dropout masks to average over
Returns:

  • for each output, a tuple (y_pred, y_std) where y_pred is the predicted
  • value of the output, and each element of y_std estimates the standard
  • deviation of the corresponding element of y_pred

predict_uncertainty_on_batch(X, masks=50)[source]

Predict the model’s outputs, along with the uncertainty in each one.

The uncertainty is computed as described in https://arxiv.org/abs/1703.04977. It involves repeating the prediction many times with different dropout masks. The prediction is computed as the average over all the predictions. The uncertainty includes both the variation among the predicted values (epistemic uncertainty) and the model’s own estimates for how well it fits the data (aleatoric uncertainty). Not all models support uncertainty prediction.

Parameters:
  • X (ndarray) – the input data, as a Numpy array.
  • masks (int) – the number of dropout masks to average over
Returns:

  • for each output, a tuple (y_pred, y_std) where y_pred is the predicted
  • value of the output, and each element of y_std estimates the standard
  • deviation of the corresponding element of y_pred

reload()[source]

Reload trained model from disk.

restore(checkpoint=None, model_dir=None, session=None)[source]

Reload the values of all variables from a checkpoint file.

Parameters:
  • checkpoint (str) – the path to the checkpoint file to load. If this is None, the most recent checkpoint will be chosen automatically. Call get_checkpoints() to get a list of all available checkpoints.
  • model_dir (str, default None) – Directory to restore checkpoint from. If None, use self.model_dir.
  • session (tf.Session(), default None) – Session to run restore ops under. If None, self.session is used.
save()[source]

Dispatcher function for saving.

Each subclass is responsible for overriding this method.

save_checkpoint(max_checkpoints_to_keep=5, model_dir=None)[source]

Save a checkpoint to disk.

Usually you do not need to call this method, since fit() saves checkpoints automatically. If you have disabled automatic checkpointing during fitting, this can be called to manually write checkpoints.

Parameters:
  • max_checkpoints_to_keep (int) – the maximum number of checkpoints to keep. Older checkpoints are discarded.
  • model_dir (str, default None) – Model directory to save checkpoint to. If None, revert to self.model_dir
set_params(**params)[source]

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters:**params (dict) – Estimator parameters.
Returns:self – Estimator instance.
Return type:object

GraphConvModel

class deepchem.models.GraphConvModel(n_tasks, graph_conv_layers=[64, 64], dense_layer_size=128, dropout=0.0, mode='classification', number_atom_features=75, n_classes=2, batch_size=100, batch_normalize=True, uncertainty=False, **kwargs)[source]

Graph Convolutional Models.

This class implements the graph convolutional model from the following paper:

Duvenaud, David K., et al. “Convolutional networks on graphs for learning molecular fingerprints.” Advances in neural information processing systems. 2015.

__init__(n_tasks, graph_conv_layers=[64, 64], dense_layer_size=128, dropout=0.0, mode='classification', number_atom_features=75, n_classes=2, batch_size=100, batch_normalize=True, uncertainty=False, **kwargs)[source]

The wrapper class for graph convolutions.

Note that since the underlying _GraphConvKerasModel class is specified using imperative subclassing style, this model cannout make predictions for arbitrary outputs.

Parameters:
  • n_tasks (int) – Number of tasks
  • graph_conv_layers (list of int) – Width of channels for the Graph Convolution Layers
  • dense_layer_size (int) – Width of channels for Atom Level Dense Layer before GraphPool
  • dropout (list or float) – the dropout probablity to use for each layer. The length of this list should equal len(graph_conv_layers)+1 (one value for each convolution layer, and one for the dense layer). Alternatively this may be a single value instead of a list, in which case the same value is used for every layer.
  • mode (str) – Either “classification” or “regression”
  • number_atom_features (int) – 75 is the default number of atom features created, but this can vary if various options are passed to the function atom_features in graph_features
  • n_classes (int) – the number of classes to predict (only used in classification mode)
  • batch_normalize (True) – if True, apply batch normalization to model
  • uncertainty (bool) – if True, include extra outputs and loss terms to enable the uncertainty in outputs to be predicted
compute_saliency(X)[source]

Compute the saliency map for an input sample.

This computes the Jacobian matrix with the derivative of each output element with respect to each input element. More precisely,

  • If this model has a single output, it returns a matrix of shape (output_shape, input_shape) with the derivatives.
  • If this model has multiple outputs, it returns a list of matrices, one for each output.

This method cannot be used on models that take multiple inputs.

Parameters:X (ndarray) – the input data for a single sample
Returns:
Return type:the Jacobian matrix, or a list of matrices
default_generator(dataset, epochs=1, mode='fit', deterministic=True, pad_batches=True)[source]

Create a generator that iterates batches for a dataset.

Subclasses may override this method to customize how model inputs are generated from the data.

Parameters:
  • dataset (Dataset) – the data to iterate
  • epochs (int) – the number of times to iterate over the full dataset
  • mode (str) – allowed values are ‘fit’ (called during training), ‘predict’ (called during prediction), and ‘uncertainty’ (called during uncertainty prediction)
  • deterministic (bool) – whether to iterate over the dataset in order, or randomly shuffle the data for each epoch
  • pad_batches (bool) – whether to pad each batch up to this model’s preferred batch size
Returns:

  • a generator that iterates batches, each represented as a tuple of lists
  • ([inputs], [outputs], [weights])

evaluate(dataset, metrics, transformers=[], per_task_metrics=False)[source]

Evaluates the performance of this model on specified dataset.

Parameters:
  • dataset (dc.data.Dataset) – Dataset object.
  • metric (deepchem.metrics.Metric) – Evaluation metric
  • transformers (list) – List of deepchem.transformers.Transformer
  • per_task_metrics (bool) – If True, return per-task scores.
Returns:

Maps tasks to scores under metric.

Return type:

dict

evaluate_generator(generator, metrics, transformers=[], per_task_metrics=False)[source]

Evaluate the performance of this model on the data produced by a generator.

Parameters:
  • generator (generator) – this should generate batches, each represented as a tuple of the form (inputs, labels, weights).
  • metric (deepchem.metrics.Metric) – Evaluation metric
  • transformers (list of dc.trans.Transformers) – Transformers that the input data has been transformed by. The output is passed through these transformers to undo the transformations.
  • per_task_metrics (bool) – If True, return per-task scores.
Returns:

Maps tasks to scores under metric.

Return type:

dict

fit(dataset, nb_epoch=10, max_checkpoints_to_keep=5, checkpoint_interval=1000, deterministic=False, restore=False, variables=None, loss=None, callbacks=[])[source]

Train this model on a dataset.

Parameters:
  • dataset (Dataset) – the Dataset to train on
  • nb_epoch (int) – the number of epochs to train for
  • max_checkpoints_to_keep (int) – the maximum number of checkpoints to keep. Older checkpoints are discarded.
  • checkpoint_interval (int) – the frequency at which to write checkpoints, measured in training steps. Set this to 0 to disable automatic checkpointing.
  • deterministic (bool) – if True, the samples are processed in order. If False, a different random order is used for each epoch.
  • restore (bool) – if True, restore the model from the most recent checkpoint and continue training from there. If False, retrain the model from scratch.
  • variables (list of tf.Variable) – the variables to train. If None (the default), all trainable variables in the model are used.
  • loss (function) – a function of the form f(outputs, labels, weights) that computes the loss for each batch. If None (the default), the model’s standard loss function is used.
  • callbacks (function or list of functions) – one or more functions of the form f(model, step) that will be invoked after every step. This can be used to perform validation, logging, etc.
fit_generator(generator, max_checkpoints_to_keep=5, checkpoint_interval=1000, restore=False, variables=None, loss=None, callbacks=[])[source]

Train this model on data from a generator.

Parameters:
  • generator (generator) – this should generate batches, each represented as a tuple of the form (inputs, labels, weights).
  • max_checkpoints_to_keep (int) – the maximum number of checkpoints to keep. Older checkpoints are discarded.
  • checkpoint_interval (int) – the frequency at which to write checkpoints, measured in training steps. Set this to 0 to disable automatic checkpointing.
  • restore (bool) – if True, restore the model from the most recent checkpoint and continue training from there. If False, retrain the model from scratch.
  • variables (list of tf.Variable) – the variables to train. If None (the default), all trainable variables in the model are used.
  • loss (function) – a function of the form f(outputs, labels, weights) that computes the loss for each batch. If None (the default), the model’s standard loss function is used.
  • callbacks (function or list of functions) – one or more functions of the form f(model, step) that will be invoked after every step. This can be used to perform validation, logging, etc.
Returns:

Return type:

the average loss over the most recent checkpoint interval

fit_on_batch(X, y, w, variables=None, loss=None, callbacks=[], checkpoint=True, max_checkpoints_to_keep=5)[source]

Perform a single step of training.

Parameters:
  • X (ndarray) – the inputs for the batch
  • y (ndarray) – the labels for the batch
  • w (ndarray) – the weights for the batch
  • variables (list of tf.Variable) – the variables to train. If None (the default), all trainable variables in the model are used.
  • loss (function) – a function of the form f(outputs, labels, weights) that computes the loss for each batch. If None (the default), the model’s standard loss function is used.
  • callbacks (function or list of functions) – one or more functions of the form f(model, step) that will be invoked after every step. This can be used to perform validation, logging, etc.
  • checkpoint (bool) – if true, save a checkpoint after performing the training step
  • max_checkpoints_to_keep (int) – the maximum number of checkpoints to keep. Older checkpoints are discarded.
get_checkpoints(model_dir=None)[source]

Get a list of all available checkpoint files.

Parameters:model_dir (str, default None) – Directory to get list of checkpoints from. Reverts to self.model_dir if None
get_global_step()[source]

Get the number of steps of fitting that have been performed.

static get_model_filename(model_dir)[source]

Given model directory, obtain filename for the model itself.

get_num_tasks()[source]

Get number of tasks.

get_params(deep=True)[source]

Get parameters for this estimator.

Parameters:deep (bool, default=True) – If True, will return the parameters for this estimator and contained subobjects that are estimators.
Returns:params – Parameter names mapped to their values.
Return type:mapping of string to any
static get_params_filename(model_dir)[source]

Given model directory, obtain filename for the model itself.

get_task_type()[source]

Currently models can only be classifiers or regressors.

load_from_pretrained(source_model, assignment_map=None, value_map=None, checkpoint=None, model_dir=None, include_top=True, inputs=None, **kwargs)[source]

Copies variable values from a pretrained model. source_model can either be a pretrained model or a model with the same architecture. value_map is a variable-value dictionary. If no value_map is provided, the variable values are restored to the source_model from a checkpoint and a default value_map is created. assignment_map is a dictionary mapping variables from the source_model to the current model. If no assignment_map is provided, one is made from scratch and assumes the model is composed of several different layers, with the final one being a dense layer. include_top is used to control whether or not the final dense layer is used. The default assignment map is useful in cases where the type of task is different (classification vs regression) and/or number of tasks in the setting.

Parameters:
  • source_model (dc.KerasModel, required) – source_model can either be the pretrained model or a dc.KerasModel with the same architecture as the pretrained model. It is used to restore from a checkpoint, if value_map is None and to create a default assignment map if assignment_map is None
  • assignment_map (Dict, default None) – Dictionary mapping the source_model variables and current model variables
  • value_map (Dict, default None) – Dictionary containing source_model trainable variables mapped to numpy arrays. If value_map is None, the values are restored and a default variable map is created using the restored values
  • checkpoint (str, default None) – the path to the checkpoint file to load. If this is None, the most recent checkpoint will be chosen automatically. Call get_checkpoints() to get a list of all available checkpoints
  • model_dir (str, default None) – Restore model from custom model directory if needed
  • include_top (bool, default True) – if True, copies the weights and bias associated with the final dense layer. Used only when assignment map is None
  • inputs (List, input tensors for model) – if not None, then the weights are built for both the source and self. This option is useful only for models that are built by subclassing tf.keras.Model, and not using the functional API by tf.keras
predict(dataset, transformers=[], outputs=None, output_types=None)[source]

Uses self to make predictions on provided Dataset object.

Parameters:
  • dataset (dc.data.Dataset) – Dataset to make prediction on
  • transformers (list of dc.trans.Transformers) – Transformers that the input data has been transformed by. The output is passed through these transformers to undo the transformations.
  • outputs (Tensor or list of Tensors) – The outputs to return. If this is None, the model’s standard prediction outputs will be returned. Alternatively one or more Tensors within the model may be specified, in which case the output of those Tensors will be returned.
  • output_types (list of Strings) – The output types to return. Will retrieve all outputs of these types from the model.
Returns:

  • a NumPy array of the model produces a single output, or a list of arrays
  • if it produces multiple outputs

predict_embedding(dataset)[source]

Predicts embeddings created by underlying model if any exist. An embedding must be specified to have output_type of ‘embedding’ in the model definition.

Parameters:dataset (dc.data.Dataset) – Dataset to make prediction on
Returns:
  • a NumPy array of the embeddings model produces, or a list
  • of arrays if it produces multiple embeddings
predict_on_batch(X, transformers=[], outputs=None)[source]

Generates predictions for input samples, processing samples in a batch.

Parameters:
  • X (ndarray) – the input data, as a Numpy array.
  • transformers (list of dc.trans.Transformers) – Transformers that the input data has been transformed by. The output is passed through these transformers to undo the transformations.
  • outputs (Tensor or list of Tensors) – The outputs to return. If this is None, the model’s standard prediction outputs will be returned. Alternatively one or more Tensors within the model may be specified, in which case the output of those Tensors will be returned.
Returns:

  • a NumPy array of the model produces a single output, or a list of arrays
  • if it produces multiple outputs

predict_on_generator(generator, transformers=[], outputs=None, output_types=None)[source]
Parameters:
  • generator (generator) – this should generate batches, each represented as a tuple of the form (inputs, labels, weights).
  • transformers (list of dc.trans.Transformers) – Transformers that the input data has been transformed by. The output is passed through these transformers to undo the transformations.
  • outputs (Tensor or list of Tensors) – The outputs to return. If this is None, the model’s standard prediction outputs will be returned. Alternatively one or more Tensors within the model may be specified, in which case the output of those Tensors will be returned. If outputs is specified, output_types must be None.
  • output_types (String or list of Strings) – If specified, all outputs of this type will be retrieved from the model. If output_types is specified, outputs must be None.
  • Returns – a NumPy array of the model produces a single output, or a list of arrays if it produces multiple outputs
predict_uncertainty(dataset, masks=50)[source]

Predict the model’s outputs, along with the uncertainty in each one.

The uncertainty is computed as described in https://arxiv.org/abs/1703.04977. It involves repeating the prediction many times with different dropout masks. The prediction is computed as the average over all the predictions. The uncertainty includes both the variation among the predicted values (epistemic uncertainty) and the model’s own estimates for how well it fits the data (aleatoric uncertainty). Not all models support uncertainty prediction.

Parameters:
  • dataset (dc.data.Dataset) – Dataset to make prediction on
  • masks (int) – the number of dropout masks to average over
Returns:

  • for each output, a tuple (y_pred, y_std) where y_pred is the predicted
  • value of the output, and each element of y_std estimates the standard
  • deviation of the corresponding element of y_pred

predict_uncertainty_on_batch(X, masks=50)[source]

Predict the model’s outputs, along with the uncertainty in each one.

The uncertainty is computed as described in https://arxiv.org/abs/1703.04977. It involves repeating the prediction many times with different dropout masks. The prediction is computed as the average over all the predictions. The uncertainty includes both the variation among the predicted values (epistemic uncertainty) and the model’s own estimates for how well it fits the data (aleatoric uncertainty). Not all models support uncertainty prediction.

Parameters:
  • X (ndarray) – the input data, as a Numpy array.
  • masks (int) – the number of dropout masks to average over
Returns:

  • for each output, a tuple (y_pred, y_std) where y_pred is the predicted
  • value of the output, and each element of y_std estimates the standard
  • deviation of the corresponding element of y_pred

reload()[source]

Reload trained model from disk.

restore(checkpoint=None, model_dir=None, session=None)[source]

Reload the values of all variables from a checkpoint file.

Parameters:
  • checkpoint (str) – the path to the checkpoint file to load. If this is None, the most recent checkpoint will be chosen automatically. Call get_checkpoints() to get a list of all available checkpoints.
  • model_dir (str, default None) – Directory to restore checkpoint from. If None, use self.model_dir.
  • session (tf.Session(), default None) – Session to run restore ops under. If None, self.session is used.
save()[source]

Dispatcher function for saving.

Each subclass is responsible for overriding this method.

save_checkpoint(max_checkpoints_to_keep=5, model_dir=None)[source]

Save a checkpoint to disk.

Usually you do not need to call this method, since fit() saves checkpoints automatically. If you have disabled automatic checkpointing during fitting, this can be called to manually write checkpoints.

Parameters:
  • max_checkpoints_to_keep (int) – the maximum number of checkpoints to keep. Older checkpoints are discarded.
  • model_dir (str, default None) – Model directory to save checkpoint to. If None, revert to self.model_dir
set_params(**params)[source]

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters:**params (dict) – Estimator parameters.
Returns:self – Estimator instance.
Return type:object

MPNNModel

class deepchem.models.MPNNModel(n_tasks, n_atom_feat=70, n_pair_feat=8, n_hidden=100, T=5, M=10, mode='regression', dropout=0.0, n_classes=2, uncertainty=False, batch_size=100, **kwargs)[source]

Message Passing Neural Network,

Message Passing Neural Networks treat graph convolutional operations as an instantiation of a more general message passing schem. Recall that message passing in a graph is when nodes in a graph send each other “messages” and update their internal state as a consequence of these messages.

Ordering structures in this model are built according to

Vinyals, Oriol, Samy Bengio, and Manjunath Kudlur. “Order matters: Sequence to sequence for sets.” arXiv preprint arXiv:1511.06391 (2015).

__init__(n_tasks, n_atom_feat=70, n_pair_feat=8, n_hidden=100, T=5, M=10, mode='regression', dropout=0.0, n_classes=2, uncertainty=False, batch_size=100, **kwargs)[source]
Parameters:
  • n_tasks (int) – Number of tasks
  • n_atom_feat (int, optional) – Number of features per atom.
  • n_pair_feat (int, optional) – Number of features per pair of atoms.
  • n_hidden (int, optional) – Number of units(convolution depths) in corresponding hidden layer
  • n_graph_feat (int, optional) – Number of output features for each molecule(graph)
  • dropout (float) – the dropout probablity to use.
  • n_classes (int) – the number of classes to predict (only used in classification mode)
  • uncertainty (bool) – if True, include extra outputs and loss terms to enable the uncertainty in outputs to be predicted
compute_saliency(X)[source]

Compute the saliency map for an input sample.

This computes the Jacobian matrix with the derivative of each output element with respect to each input element. More precisely,

  • If this model has a single output, it returns a matrix of shape (output_shape, input_shape) with the derivatives.
  • If this model has multiple outputs, it returns a list of matrices, one for each output.

This method cannot be used on models that take multiple inputs.

Parameters:X (ndarray) – the input data for a single sample
Returns:
Return type:the Jacobian matrix, or a list of matrices
default_generator(dataset, epochs=1, mode='fit', deterministic=True, pad_batches=True)[source]

Create a generator that iterates batches for a dataset.

Subclasses may override this method to customize how model inputs are generated from the data.

Parameters:
  • dataset (Dataset) – the data to iterate
  • epochs (int) – the number of times to iterate over the full dataset
  • mode (str) – allowed values are ‘fit’ (called during training), ‘predict’ (called during prediction), and ‘uncertainty’ (called during uncertainty prediction)
  • deterministic (bool) – whether to iterate over the dataset in order, or randomly shuffle the data for each epoch
  • pad_batches (bool) – whether to pad each batch up to this model’s preferred batch size
Returns:

  • a generator that iterates batches, each represented as a tuple of lists
  • ([inputs], [outputs], [weights])

evaluate(dataset, metrics, transformers=[], per_task_metrics=False)[source]

Evaluates the performance of this model on specified dataset.

Parameters:
  • dataset (dc.data.Dataset) – Dataset object.
  • metric (deepchem.metrics.Metric) – Evaluation metric
  • transformers (list) – List of deepchem.transformers.Transformer
  • per_task_metrics (bool) – If True, return per-task scores.
Returns:

Maps tasks to scores under metric.

Return type:

dict

evaluate_generator(generator, metrics, transformers=[], per_task_metrics=False)[source]

Evaluate the performance of this model on the data produced by a generator.

Parameters:
  • generator (generator) – this should generate batches, each represented as a tuple of the form (inputs, labels, weights).
  • metric (deepchem.metrics.Metric) – Evaluation metric
  • transformers (list of dc.trans.Transformers) – Transformers that the input data has been transformed by. The output is passed through these transformers to undo the transformations.
  • per_task_metrics (bool) – If True, return per-task scores.
Returns:

Maps tasks to scores under metric.

Return type:

dict

fit(dataset, nb_epoch=10, max_checkpoints_to_keep=5, checkpoint_interval=1000, deterministic=False, restore=False, variables=None, loss=None, callbacks=[])[source]

Train this model on a dataset.

Parameters:
  • dataset (Dataset) – the Dataset to train on
  • nb_epoch (int) – the number of epochs to train for
  • max_checkpoints_to_keep (int) – the maximum number of checkpoints to keep. Older checkpoints are discarded.
  • checkpoint_interval (int) – the frequency at which to write checkpoints, measured in training steps. Set this to 0 to disable automatic checkpointing.
  • deterministic (bool) – if True, the samples are processed in order. If False, a different random order is used for each epoch.
  • restore (bool) – if True, restore the model from the most recent checkpoint and continue training from there. If False, retrain the model from scratch.
  • variables (list of tf.Variable) – the variables to train. If None (the default), all trainable variables in the model are used.
  • loss (function) – a function of the form f(outputs, labels, weights) that computes the loss for each batch. If None (the default), the model’s standard loss function is used.
  • callbacks (function or list of functions) – one or more functions of the form f(model, step) that will be invoked after every step. This can be used to perform validation, logging, etc.
fit_generator(generator, max_checkpoints_to_keep=5, checkpoint_interval=1000, restore=False, variables=None, loss=None, callbacks=[])[source]

Train this model on data from a generator.

Parameters:
  • generator (generator) – this should generate batches, each represented as a tuple of the form (inputs, labels, weights).
  • max_checkpoints_to_keep (int) – the maximum number of checkpoints to keep. Older checkpoints are discarded.
  • checkpoint_interval (int) – the frequency at which to write checkpoints, measured in training steps. Set this to 0 to disable automatic checkpointing.
  • restore (bool) – if True, restore the model from the most recent checkpoint and continue training from there. If False, retrain the model from scratch.
  • variables (list of tf.Variable) – the variables to train. If None (the default), all trainable variables in the model are used.
  • loss (function) – a function of the form f(outputs, labels, weights) that computes the loss for each batch. If None (the default), the model’s standard loss function is used.
  • callbacks (function or list of functions) – one or more functions of the form f(model, step) that will be invoked after every step. This can be used to perform validation, logging, etc.
Returns:

Return type:

the average loss over the most recent checkpoint interval

fit_on_batch(X, y, w, variables=None, loss=None, callbacks=[], checkpoint=True, max_checkpoints_to_keep=5)[source]

Perform a single step of training.

Parameters:
  • X (ndarray) – the inputs for the batch
  • y (ndarray) – the labels for the batch
  • w (ndarray) – the weights for the batch
  • variables (list of tf.Variable) – the variables to train. If None (the default), all trainable variables in the model are used.
  • loss (function) – a function of the form f(outputs, labels, weights) that computes the loss for each batch. If None (the default), the model’s standard loss function is used.
  • callbacks (function or list of functions) – one or more functions of the form f(model, step) that will be invoked after every step. This can be used to perform validation, logging, etc.
  • checkpoint (bool) – if true, save a checkpoint after performing the training step
  • max_checkpoints_to_keep (int) – the maximum number of checkpoints to keep. Older checkpoints are discarded.
get_checkpoints(model_dir=None)[source]

Get a list of all available checkpoint files.

Parameters:model_dir (str, default None) – Directory to get list of checkpoints from. Reverts to self.model_dir if None
get_global_step()[source]

Get the number of steps of fitting that have been performed.

static get_model_filename(model_dir)[source]

Given model directory, obtain filename for the model itself.

get_num_tasks()[source]

Get number of tasks.

get_params(deep=True)[source]

Get parameters for this estimator.

Parameters:deep (bool, default=True) – If True, will return the parameters for this estimator and contained subobjects that are estimators.
Returns:params – Parameter names mapped to their values.
Return type:mapping of string to any
static get_params_filename(model_dir)[source]

Given model directory, obtain filename for the model itself.

get_task_type()[source]

Currently models can only be classifiers or regressors.

load_from_pretrained(source_model, assignment_map=None, value_map=None, checkpoint=None, model_dir=None, include_top=True, inputs=None, **kwargs)[source]

Copies variable values from a pretrained model. source_model can either be a pretrained model or a model with the same architecture. value_map is a variable-value dictionary. If no value_map is provided, the variable values are restored to the source_model from a checkpoint and a default value_map is created. assignment_map is a dictionary mapping variables from the source_model to the current model. If no assignment_map is provided, one is made from scratch and assumes the model is composed of several different layers, with the final one being a dense layer. include_top is used to control whether or not the final dense layer is used. The default assignment map is useful in cases where the type of task is different (classification vs regression) and/or number of tasks in the setting.

Parameters:
  • source_model (dc.KerasModel, required) – source_model can either be the pretrained model or a dc.KerasModel with the same architecture as the pretrained model. It is used to restore from a checkpoint, if value_map is None and to create a default assignment map if assignment_map is None
  • assignment_map (Dict, default None) – Dictionary mapping the source_model variables and current model variables
  • value_map (Dict, default None) – Dictionary containing source_model trainable variables mapped to numpy arrays. If value_map is None, the values are restored and a default variable map is created using the restored values
  • checkpoint (str, default None) – the path to the checkpoint file to load. If this is None, the most recent checkpoint will be chosen automatically. Call get_checkpoints() to get a list of all available checkpoints
  • model_dir (str, default None) – Restore model from custom model directory if needed
  • include_top (bool, default True) – if True, copies the weights and bias associated with the final dense layer. Used only when assignment map is None
  • inputs (List, input tensors for model) – if not None, then the weights are built for both the source and self. This option is useful only for models that are built by subclassing tf.keras.Model, and not using the functional API by tf.keras
predict(dataset, transformers=[], outputs=None, output_types=None)[source]

Uses self to make predictions on provided Dataset object.

Parameters:
  • dataset (dc.data.Dataset) – Dataset to make prediction on
  • transformers (list of dc.trans.Transformers) – Transformers that the input data has been transformed by. The output is passed through these transformers to undo the transformations.
  • outputs (Tensor or list of Tensors) – The outputs to return. If this is None, the model’s standard prediction outputs will be returned. Alternatively one or more Tensors within the model may be specified, in which case the output of those Tensors will be returned.
  • output_types (list of Strings) – The output types to return. Will retrieve all outputs of these types from the model.
Returns:

  • a NumPy array of the model produces a single output, or a list of arrays
  • if it produces multiple outputs

predict_embedding(dataset)[source]

Predicts embeddings created by underlying model if any exist. An embedding must be specified to have output_type of ‘embedding’ in the model definition.

Parameters:dataset (dc.data.Dataset) – Dataset to make prediction on
Returns:
  • a NumPy array of the embeddings model produces, or a list
  • of arrays if it produces multiple embeddings
predict_on_batch(X, transformers=[], outputs=None)[source]

Generates predictions for input samples, processing samples in a batch.

Parameters:
  • X (ndarray) – the input data, as a Numpy array.
  • transformers (list of dc.trans.Transformers) – Transformers that the input data has been transformed by. The output is passed through these transformers to undo the transformations.
  • outputs (Tensor or list of Tensors) – The outputs to return. If this is None, the model’s standard prediction outputs will be returned. Alternatively one or more Tensors within the model may be specified, in which case the output of those Tensors will be returned.
Returns:

  • a NumPy array of the model produces a single output, or a list of arrays
  • if it produces multiple outputs

predict_on_generator(generator, transformers=[], outputs=None, output_types=None)[source]
Parameters:
  • generator (generator) – this should generate batches, each represented as a tuple of the form (inputs, labels, weights).
  • transformers (list of dc.trans.Transformers) – Transformers that the input data has been transformed by. The output is passed through these transformers to undo the transformations.
  • outputs (Tensor or list of Tensors) – The outputs to return. If this is None, the model’s standard prediction outputs will be returned. Alternatively one or more Tensors within the model may be specified, in which case the output of those Tensors will be returned. If outputs is specified, output_types must be None.
  • output_types (String or list of Strings) – If specified, all outputs of this type will be retrieved from the model. If output_types is specified, outputs must be None.
  • Returns – a NumPy array of the model produces a single output, or a list of arrays if it produces multiple outputs
predict_uncertainty(dataset, masks=50)[source]

Predict the model’s outputs, along with the uncertainty in each one.

The uncertainty is computed as described in https://arxiv.org/abs/1703.04977. It involves repeating the prediction many times with different dropout masks. The prediction is computed as the average over all the predictions. The uncertainty includes both the variation among the predicted values (epistemic uncertainty) and the model’s own estimates for how well it fits the data (aleatoric uncertainty). Not all models support uncertainty prediction.

Parameters:
  • dataset (dc.data.Dataset) – Dataset to make prediction on
  • masks (int) – the number of dropout masks to average over
Returns:

  • for each output, a tuple (y_pred, y_std) where y_pred is the predicted
  • value of the output, and each element of y_std estimates the standard
  • deviation of the corresponding element of y_pred

predict_uncertainty_on_batch(X, masks=50)[source]

Predict the model’s outputs, along with the uncertainty in each one.

The uncertainty is computed as described in https://arxiv.org/abs/1703.04977. It involves repeating the prediction many times with different dropout masks. The prediction is computed as the average over all the predictions. The uncertainty includes both the variation among the predicted values (epistemic uncertainty) and the model’s own estimates for how well it fits the data (aleatoric uncertainty). Not all models support uncertainty prediction.

Parameters:
  • X (ndarray) – the input data, as a Numpy array.
  • masks (int) – the number of dropout masks to average over
Returns:

  • for each output, a tuple (y_pred, y_std) where y_pred is the predicted
  • value of the output, and each element of y_std estimates the standard
  • deviation of the corresponding element of y_pred

reload()[source]

Reload trained model from disk.

restore(checkpoint=None, model_dir=None, session=None)[source]

Reload the values of all variables from a checkpoint file.

Parameters:
  • checkpoint (str) – the path to the checkpoint file to load. If this is None, the most recent checkpoint will be chosen automatically. Call get_checkpoints() to get a list of all available checkpoints.
  • model_dir (str, default None) – Directory to restore checkpoint from. If None, use self.model_dir.
  • session (tf.Session(), default None) – Session to run restore ops under. If None, self.session is used.
save()[source]

Dispatcher function for saving.

Each subclass is responsible for overriding this method.

save_checkpoint(max_checkpoints_to_keep=5, model_dir=None)[source]

Save a checkpoint to disk.

Usually you do not need to call this method, since fit() saves checkpoints automatically. If you have disabled automatic checkpointing during fitting, this can be called to manually write checkpoints.

Parameters:
  • max_checkpoints_to_keep (int) – the maximum number of checkpoints to keep. Older checkpoints are discarded.
  • model_dir (str, default None) – Model directory to save checkpoint to. If None, revert to self.model_dir
set_params(**params)[source]

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters:**params (dict) – Estimator parameters.
Returns:self – Estimator instance.
Return type:object

ScScoreModel

class deepchem.models.ScScoreModel(n_features, layer_sizes=[300, 300, 300], dropouts=0.0, **kwargs)[source]

https://pubs.acs.org/doi/abs/10.1021/acs.jcim.7b00622 Several definitions of molecular complexity exist to facilitate prioritization of lead compounds, to identify diversity-inducing and complexifying reactions, and to guide retrosynthetic searches. In this work, we focus on synthetic complexity and reformalize its definition to correlate with the expected number of reaction steps required to produce a target molecule, with implicit knowledge about what compounds are reasonable starting materials. We train a neural network model on 12 million reactions from the Reaxys database to impose a pairwise inequality constraint enforcing the premise of this definition: that on average, the products of published chemical reactions should be more synthetically complex than their corresponding reactants. The learned metric (SCScore) exhibits highly desirable nonlinear behavior, particularly in recognizing increases in synthetic complexity throughout a number of linear synthetic routes.

Our model here actually uses hingeloss instead of the shifted relu loss in https://github.com/connorcoley/scscore.

This could cause issues differentiation issues with compounds that are “close” to each other in “complexity”

__init__(n_features, layer_sizes=[300, 300, 300], dropouts=0.0, **kwargs)[source]
Parameters:
  • n_features (int) – number of features per molecule
  • layer_sizes (list of int) – size of each hidden layer
  • dropouts (int) – droupout to apply to each hidden layer
  • kwargs – This takes all kwards as TensorGraph
compute_saliency(X)[source]

Compute the saliency map for an input sample.

This computes the Jacobian matrix with the derivative of each output element with respect to each input element. More precisely,

  • If this model has a single output, it returns a matrix of shape (output_shape, input_shape) with the derivatives.
  • If this model has multiple outputs, it returns a list of matrices, one for each output.

This method cannot be used on models that take multiple inputs.

Parameters:X (ndarray) – the input data for a single sample
Returns:
Return type:the Jacobian matrix, or a list of matrices
default_generator(dataset, epochs=1, mode='fit', deterministic=True, pad_batches=True)[source]

Create a generator that iterates batches for a dataset.

Subclasses may override this method to customize how model inputs are generated from the data.

Parameters:
  • dataset (Dataset) – the data to iterate
  • epochs (int) – the number of times to iterate over the full dataset
  • mode (str) – allowed values are ‘fit’ (called during training), ‘predict’ (called during prediction), and ‘uncertainty’ (called during uncertainty prediction)
  • deterministic (bool) – whether to iterate over the dataset in order, or randomly shuffle the data for each epoch
  • pad_batches (bool) – whether to pad each batch up to this model’s preferred batch size
Returns:

  • a generator that iterates batches, each represented as a tuple of lists
  • ([inputs], [outputs], [weights])

evaluate(dataset, metrics, transformers=[], per_task_metrics=False)[source]

Evaluates the performance of this model on specified dataset.

Parameters:
  • dataset (dc.data.Dataset) – Dataset object.
  • metric (deepchem.metrics.Metric) – Evaluation metric
  • transformers (list) – List of deepchem.transformers.Transformer
  • per_task_metrics (bool) – If True, return per-task scores.
Returns:

Maps tasks to scores under metric.

Return type:

dict

evaluate_generator(generator, metrics, transformers=[], per_task_metrics=False)[source]

Evaluate the performance of this model on the data produced by a generator.

Parameters:
  • generator (generator) – this should generate batches, each represented as a tuple of the form (inputs, labels, weights).
  • metric (deepchem.metrics.Metric) – Evaluation metric
  • transformers (list of dc.trans.Transformers) – Transformers that the input data has been transformed by. The output is passed through these transformers to undo the transformations.
  • per_task_metrics (bool) – If True, return per-task scores.
Returns:

Maps tasks to scores under metric.

Return type:

dict

fit(dataset, nb_epoch=10, max_checkpoints_to_keep=5, checkpoint_interval=1000, deterministic=False, restore=False, variables=None, loss=None, callbacks=[])[source]

Train this model on a dataset.

Parameters:
  • dataset (Dataset) – the Dataset to train on
  • nb_epoch (int) – the number of epochs to train for
  • max_checkpoints_to_keep (int) – the maximum number of checkpoints to keep. Older checkpoints are discarded.
  • checkpoint_interval (int) – the frequency at which to write checkpoints, measured in training steps. Set this to 0 to disable automatic checkpointing.
  • deterministic (bool) – if True, the samples are processed in order. If False, a different random order is used for each epoch.
  • restore (bool) – if True, restore the model from the most recent checkpoint and continue training from there. If False, retrain the model from scratch.
  • variables (list of tf.Variable) – the variables to train. If None (the default), all trainable variables in the model are used.
  • loss (function) – a function of the form f(outputs, labels, weights) that computes the loss for each batch. If None (the default), the model’s standard loss function is used.
  • callbacks (function or list of functions) – one or more functions of the form f(model, step) that will be invoked after every step. This can be used to perform validation, logging, etc.
fit_generator(generator, max_checkpoints_to_keep=5, checkpoint_interval=1000, restore=False, variables=None, loss=None, callbacks=[])[source]

Train this model on data from a generator.

Parameters:
  • generator (generator) – this should generate batches, each represented as a tuple of the form (inputs, labels, weights).
  • max_checkpoints_to_keep (int) – the maximum number of checkpoints to keep. Older checkpoints are discarded.
  • checkpoint_interval (int) – the frequency at which to write checkpoints, measured in training steps. Set this to 0 to disable automatic checkpointing.
  • restore (bool) – if True, restore the model from the most recent checkpoint and continue training from there. If False, retrain the model from scratch.
  • variables (list of tf.Variable) – the variables to train. If None (the default), all trainable variables in the model are used.
  • loss (function) – a function of the form f(outputs, labels, weights) that computes the loss for each batch. If None (the default), the model’s standard loss function is used.
  • callbacks (function or list of functions) – one or more functions of the form f(model, step) that will be invoked after every step. This can be used to perform validation, logging, etc.
Returns:

Return type:

the average loss over the most recent checkpoint interval

fit_on_batch(X, y, w, variables=None, loss=None, callbacks=[], checkpoint=True, max_checkpoints_to_keep=5)[source]

Perform a single step of training.

Parameters:
  • X (ndarray) – the inputs for the batch
  • y (ndarray) – the labels for the batch
  • w (ndarray) – the weights for the batch
  • variables (list of tf.Variable) – the variables to train. If None (the default), all trainable variables in the model are used.
  • loss (function) – a function of the form f(outputs, labels, weights) that computes the loss for each batch. If None (the default), the model’s standard loss function is used.
  • callbacks (function or list of functions) – one or more functions of the form f(model, step) that will be invoked after every step. This can be used to perform validation, logging, etc.
  • checkpoint (bool) – if true, save a checkpoint after performing the training step
  • max_checkpoints_to_keep (int) – the maximum number of checkpoints to keep. Older checkpoints are discarded.
get_checkpoints(model_dir=None)[source]

Get a list of all available checkpoint files.

Parameters:model_dir (str, default None) – Directory to get list of checkpoints from. Reverts to self.model_dir if None
get_global_step()[source]

Get the number of steps of fitting that have been performed.

static get_model_filename(model_dir)[source]

Given model directory, obtain filename for the model itself.

get_num_tasks()[source]

Get number of tasks.

get_params(deep=True)[source]

Get parameters for this estimator.

Parameters:deep (bool, default=True) – If True, will return the parameters for this estimator and contained subobjects that are estimators.
Returns:params – Parameter names mapped to their values.
Return type:mapping of string to any
static get_params_filename(model_dir)[source]

Given model directory, obtain filename for the model itself.

get_task_type()[source]

Currently models can only be classifiers or regressors.

load_from_pretrained(source_model, assignment_map=None, value_map=None, checkpoint=None, model_dir=None, include_top=True, inputs=None, **kwargs)[source]

Copies variable values from a pretrained model. source_model can either be a pretrained model or a model with the same architecture. value_map is a variable-value dictionary. If no value_map is provided, the variable values are restored to the source_model from a checkpoint and a default value_map is created. assignment_map is a dictionary mapping variables from the source_model to the current model. If no assignment_map is provided, one is made from scratch and assumes the model is composed of several different layers, with the final one being a dense layer. include_top is used to control whether or not the final dense layer is used. The default assignment map is useful in cases where the type of task is different (classification vs regression) and/or number of tasks in the setting.

Parameters:
  • source_model (dc.KerasModel, required) – source_model can either be the pretrained model or a dc.KerasModel with the same architecture as the pretrained model. It is used to restore from a checkpoint, if value_map is None and to create a default assignment map if assignment_map is None
  • assignment_map (Dict, default None) – Dictionary mapping the source_model variables and current model variables
  • value_map (Dict, default None) – Dictionary containing source_model trainable variables mapped to numpy arrays. If value_map is None, the values are restored and a default variable map is created using the restored values
  • checkpoint (str, default None) – the path to the checkpoint file to load. If this is None, the most recent checkpoint will be chosen automatically. Call get_checkpoints() to get a list of all available checkpoints
  • model_dir (str, default None) – Restore model from custom model directory if needed
  • include_top (bool, default True) – if True, copies the weights and bias associated with the final dense layer. Used only when assignment map is None
  • inputs (List, input tensors for model) – if not None, then the weights are built for both the source and self. This option is useful only for models that are built by subclassing tf.keras.Model, and not using the functional API by tf.keras
predict(dataset, transformers=[], outputs=None, output_types=None)[source]

Uses self to make predictions on provided Dataset object.

Parameters:
  • dataset (dc.data.Dataset) – Dataset to make prediction on
  • transformers (list of dc.trans.Transformers) – Transformers that the input data has been transformed by. The output is passed through these transformers to undo the transformations.
  • outputs (Tensor or list of Tensors) – The outputs to return. If this is None, the model’s standard prediction outputs will be returned. Alternatively one or more Tensors within the model may be specified, in which case the output of those Tensors will be returned.
  • output_types (list of Strings) – The output types to return. Will retrieve all outputs of these types from the model.
Returns:

  • a NumPy array of the model produces a single output, or a list of arrays
  • if it produces multiple outputs

predict_embedding(dataset)[source]

Predicts embeddings created by underlying model if any exist. An embedding must be specified to have output_type of ‘embedding’ in the model definition.

Parameters:dataset (dc.data.Dataset) – Dataset to make prediction on
Returns:
  • a NumPy array of the embeddings model produces, or a list
  • of arrays if it produces multiple embeddings
predict_on_batch(X, transformers=[], outputs=None)[source]

Generates predictions for input samples, processing samples in a batch.

Parameters:
  • X (ndarray) – the input data, as a Numpy array.
  • transformers (list of dc.trans.Transformers) – Transformers that the input data has been transformed by. The output is passed through these transformers to undo the transformations.
  • outputs (Tensor or list of Tensors) – The outputs to return. If this is None, the model’s standard prediction outputs will be returned. Alternatively one or more Tensors within the model may be specified, in which case the output of those Tensors will be returned.
Returns:

  • a NumPy array of the model produces a single output, or a list of arrays
  • if it produces multiple outputs

predict_on_generator(generator, transformers=[], outputs=None, output_types=None)[source]
Parameters:
  • generator (generator) – this should generate batches, each represented as a tuple of the form (inputs, labels, weights).
  • transformers (list of dc.trans.Transformers) – Transformers that the input data has been transformed by. The output is passed through these transformers to undo the transformations.
  • outputs (Tensor or list of Tensors) – The outputs to return. If this is None, the model’s standard prediction outputs will be returned. Alternatively one or more Tensors within the model may be specified, in which case the output of those Tensors will be returned. If outputs is specified, output_types must be None.
  • output_types (String or list of Strings) – If specified, all outputs of this type will be retrieved from the model. If output_types is specified, outputs must be None.
  • Returns – a NumPy array of the model produces a single output, or a list of arrays if it produces multiple outputs
predict_uncertainty(dataset, masks=50)[source]

Predict the model’s outputs, along with the uncertainty in each one.

The uncertainty is computed as described in https://arxiv.org/abs/1703.04977. It involves repeating the prediction many times with different dropout masks. The prediction is computed as the average over all the predictions. The uncertainty includes both the variation among the predicted values (epistemic uncertainty) and the model’s own estimates for how well it fits the data (aleatoric uncertainty). Not all models support uncertainty prediction.

Parameters:
  • dataset (dc.data.Dataset) – Dataset to make prediction on
  • masks (int) – the number of dropout masks to average over
Returns:

  • for each output, a tuple (y_pred, y_std) where y_pred is the predicted
  • value of the output, and each element of y_std estimates the standard
  • deviation of the corresponding element of y_pred

predict_uncertainty_on_batch(X, masks=50)[source]

Predict the model’s outputs, along with the uncertainty in each one.

The uncertainty is computed as described in https://arxiv.org/abs/1703.04977. It involves repeating the prediction many times with different dropout masks. The prediction is computed as the average over all the predictions. The uncertainty includes both the variation among the predicted values (epistemic uncertainty) and the model’s own estimates for how well it fits the data (aleatoric uncertainty). Not all models support uncertainty prediction.

Parameters:
  • X (ndarray) – the input data, as a Numpy array.
  • masks (int) – the number of dropout masks to average over
Returns:

  • for each output, a tuple (y_pred, y_std) where y_pred is the predicted
  • value of the output, and each element of y_std estimates the standard
  • deviation of the corresponding element of y_pred

reload()[source]

Reload trained model from disk.

restore(checkpoint=None, model_dir=None, session=None)[source]

Reload the values of all variables from a checkpoint file.

Parameters:
  • checkpoint (str) – the path to the checkpoint file to load. If this is None, the most recent checkpoint will be chosen automatically. Call get_checkpoints() to get a list of all available checkpoints.
  • model_dir (str, default None) – Directory to restore checkpoint from. If None, use self.model_dir.
  • session (tf.Session(), default None) – Session to run restore ops under. If None, self.session is used.
save()[source]

Dispatcher function for saving.

Each subclass is responsible for overriding this method.

save_checkpoint(max_checkpoints_to_keep=5, model_dir=None)[source]

Save a checkpoint to disk.

Usually you do not need to call this method, since fit() saves checkpoints automatically. If you have disabled automatic checkpointing during fitting, this can be called to manually write checkpoints.

Parameters:
  • max_checkpoints_to_keep (int) – the maximum number of checkpoints to keep. Older checkpoints are discarded.
  • model_dir (str, default None) – Model directory to save checkpoint to. If None, revert to self.model_dir
set_params(**params)[source]

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters:**params (dict) – Estimator parameters.
Returns:self – Estimator instance.
Return type:object

SeqToSeq

class deepchem.models.SeqToSeq(input_tokens, output_tokens, max_output_length, encoder_layers=4, decoder_layers=4, embedding_dimension=512, dropout=0.0, reverse_input=True, variational=False, annealing_start_step=5000, annealing_final_step=10000, **kwargs)[source]

Implements sequence to sequence translation models.

The model is based on the description in Sutskever et al., “Sequence to Sequence Learning with Neural Networks” (https://arxiv.org/abs/1409.3215), although this implementation uses GRUs instead of LSTMs. The goal is to take sequences of tokens as input, and translate each one into a different output sequence. The input and output sequences can both be of variable length, and an output sequence need not have the same length as the input sequence it was generated from. For example, these models were originally developed for use in natural language processing. In that context, the input might be a sequence of English words, and the output might be a sequence of French words. The goal would be to train the model to translate sentences from English to French.

The model consists of two parts called the “encoder” and “decoder”. Each one consists of a stack of recurrent layers. The job of the encoder is to transform the input sequence into a single, fixed length vector called the “embedding”. That vector contains all relevant information from the input sequence. The decoder then transforms the embedding vector into the output sequence.

These models can be used for various purposes. First and most obviously, they can be used for sequence to sequence translation. In any case where you have sequences of tokens, and you want to translate each one into a different sequence, a SeqToSeq model can be trained to perform the translation.

Another possible use case is transforming variable length sequences into fixed length vectors. Many types of models require their inputs to have a fixed shape, which makes it difficult to use them with variable sized inputs (for example, when the input is a molecule, and different molecules have different numbers of atoms). In that case, you can train a SeqToSeq model as an autoencoder, so that it tries to make the output sequence identical to the input one. That forces the embedding vector to contain all information from the original sequence. You can then use the encoder for transforming sequences into fixed length embedding vectors, suitable to use as inputs to other types of models.

Another use case is to train the decoder for use as a generative model. Here again you begin by training the SeqToSeq model as an autoencoder. Once training is complete, you can supply arbitrary embedding vectors, and transform each one into an output sequence. When used in this way, you typically train it as a variational autoencoder. This adds random noise to the encoder, and also adds a constraint term to the loss that forces the embedding vector to have a unit Gaussian distribution. You can then pick random vectors from a Gaussian distribution, and the output sequences should follow the same distribution as the training data.

When training as a variational autoencoder, it is best to use KL cost annealing, as described in https://arxiv.org/abs/1511.06349. The constraint term in the loss is initially set to 0, so the optimizer just tries to minimize the reconstruction loss. Once it has made reasonable progress toward that, the constraint term can be gradually turned back on. The range of steps over which this happens is configurable.

__init__(input_tokens, output_tokens, max_output_length, encoder_layers=4, decoder_layers=4, embedding_dimension=512, dropout=0.0, reverse_input=True, variational=False, annealing_start_step=5000, annealing_final_step=10000, **kwargs)[source]

Construct a SeqToSeq model.

In addition to the following arguments, this class also accepts all the keyword arguments from TensorGraph.

Parameters:
  • input_tokens (list) – a list of all tokens that may appear in input sequences
  • output_tokens (list) – a list of all tokens that may appear in output sequences
  • max_output_length (int) – the maximum length of output sequence that may be generated
  • encoder_layers (int) – the number of recurrent layers in the encoder
  • decoder_layers (int) – the number of recurrent layers in the decoder
  • embedding_dimension (int) – the width of the embedding vector. This also is the width of all recurrent layers.
  • dropout (float) – the dropout probability to use during training
  • reverse_input (bool) – if True, reverse the order of input sequences before sending them into the encoder. This can improve performance when working with long sequences.
  • variational (bool) – if True, train the model as a variational autoencoder. This adds random noise to the encoder, and also constrains the embedding to follow a unit Gaussian distribution.
  • annealing_start_step (int) – the step (that is, batch) at which to begin turning on the constraint term for KL cost annealing
  • annealing_final_step (int) – the step (that is, batch) at which to finish turning on the constraint term for KL cost annealing
compute_saliency(X)[source]

Compute the saliency map for an input sample.

This computes the Jacobian matrix with the derivative of each output element with respect to each input element. More precisely,

  • If this model has a single output, it returns a matrix of shape (output_shape, input_shape) with the derivatives.
  • If this model has multiple outputs, it returns a list of matrices, one for each output.

This method cannot be used on models that take multiple inputs.

Parameters:X (ndarray) – the input data for a single sample
Returns:
Return type:the Jacobian matrix, or a list of matrices
default_generator(dataset, epochs=1, mode='fit', deterministic=True, pad_batches=True)[source]

Create a generator that iterates batches for a dataset.

Subclasses may override this method to customize how model inputs are generated from the data.

Parameters:
  • dataset (Dataset) – the data to iterate
  • epochs (int) – the number of times to iterate over the full dataset
  • mode (str) – allowed values are ‘fit’ (called during training), ‘predict’ (called during prediction), and ‘uncertainty’ (called during uncertainty prediction)
  • deterministic (bool) – whether to iterate over the dataset in order, or randomly shuffle the data for each epoch
  • pad_batches (bool) – whether to pad each batch up to this model’s preferred batch size
Returns:

  • a generator that iterates batches, each represented as a tuple of lists
  • ([inputs], [outputs], [weights])

evaluate(dataset, metrics, transformers=[], per_task_metrics=False)[source]

Evaluates the performance of this model on specified dataset.

Parameters:
  • dataset (dc.data.Dataset) – Dataset object.
  • metric (deepchem.metrics.Metric) – Evaluation metric
  • transformers (list) – List of deepchem.transformers.Transformer
  • per_task_metrics (bool) – If True, return per-task scores.
Returns:

Maps tasks to scores under metric.

Return type:

dict

evaluate_generator(generator, metrics, transformers=[], per_task_metrics=False)[source]

Evaluate the performance of this model on the data produced by a generator.

Parameters:
  • generator (generator) – this should generate batches, each represented as a tuple of the form (inputs, labels, weights).
  • metric (deepchem.metrics.Metric) – Evaluation metric
  • transformers (list of dc.trans.Transformers) – Transformers that the input data has been transformed by. The output is passed through these transformers to undo the transformations.
  • per_task_metrics (bool) – If True, return per-task scores.
Returns:

Maps tasks to scores under metric.

Return type:

dict

fit(dataset, nb_epoch=10, max_checkpoints_to_keep=5, checkpoint_interval=1000, deterministic=False, restore=False, variables=None, loss=None, callbacks=[])[source]

Train this model on a dataset.

Parameters:
  • dataset (Dataset) – the Dataset to train on
  • nb_epoch (int) – the number of epochs to train for
  • max_checkpoints_to_keep (int) – the maximum number of checkpoints to keep. Older checkpoints are discarded.
  • checkpoint_interval (int) – the frequency at which to write checkpoints, measured in training steps. Set this to 0 to disable automatic checkpointing.
  • deterministic (bool) – if True, the samples are processed in order. If False, a different random order is used for each epoch.
  • restore (bool) – if True, restore the model from the most recent checkpoint and continue training from there. If False, retrain the model from scratch.
  • variables (list of tf.Variable) – the variables to train. If None (the default), all trainable variables in the model are used.
  • loss (function) – a function of the form f(outputs, labels, weights) that computes the loss for each batch. If None (the default), the model’s standard loss function is used.
  • callbacks (function or list of functions) – one or more functions of the form f(model, step) that will be invoked after every step. This can be used to perform validation, logging, etc.
fit_generator(generator, max_checkpoints_to_keep=5, checkpoint_interval=1000, restore=False, variables=None, loss=None, callbacks=[])[source]

Train this model on data from a generator.

Parameters:
  • generator (generator) – this should generate batches, each represented as a tuple of the form (inputs, labels, weights).
  • max_checkpoints_to_keep (int) – the maximum number of checkpoints to keep. Older checkpoints are discarded.
  • checkpoint_interval (int) – the frequency at which to write checkpoints, measured in training steps. Set this to 0 to disable automatic checkpointing.
  • restore (bool) – if True, restore the model from the most recent checkpoint and continue training from there. If False, retrain the model from scratch.
  • variables (list of tf.Variable) – the variables to train. If None (the default), all trainable variables in the model are used.
  • loss (function) – a function of the form f(outputs, labels, weights) that computes the loss for each batch. If None (the default), the model’s standard loss function is used.
  • callbacks (function or list of functions) – one or more functions of the form f(model, step) that will be invoked after every step. This can be used to perform validation, logging, etc.
Returns:

Return type:

the average loss over the most recent checkpoint interval

fit_on_batch(X, y, w, variables=None, loss=None, callbacks=[], checkpoint=True, max_checkpoints_to_keep=5)[source]

Perform a single step of training.

Parameters:
  • X (ndarray) – the inputs for the batch
  • y (ndarray) – the labels for the batch
  • w (ndarray) – the weights for the batch
  • variables (list of tf.Variable) – the variables to train. If None (the default), all trainable variables in the model are used.
  • loss (function) – a function of the form f(outputs, labels, weights) that computes the loss for each batch. If None (the default), the model’s standard loss function is used.
  • callbacks (function or list of functions) – one or more functions of the form f(model, step) that will be invoked after every step. This can be used to perform validation, logging, etc.
  • checkpoint (bool) – if true, save a checkpoint after performing the training step
  • max_checkpoints_to_keep (int) – the maximum number of checkpoints to keep. Older checkpoints are discarded.
fit_sequences(sequences, max_checkpoints_to_keep=5, checkpoint_interval=1000, restore=False)[source]

Train this model on a set of sequences

Parameters:
  • sequences (iterable) – the training samples to fit to. Each sample should be represented as a tuple of the form (input_sequence, output_sequence).
  • max_checkpoints_to_keep (int) – the maximum number of checkpoints to keep. Older checkpoints are discarded.
  • checkpoint_interval (int) – the frequency at which to write checkpoints, measured in training steps.
  • restore (bool) – if True, restore the model from the most recent checkpoint and continue training from there. If False, retrain the model from scratch.
get_checkpoints(model_dir=None)[source]

Get a list of all available checkpoint files.

Parameters:model_dir (str, default None) – Directory to get list of checkpoints from. Reverts to self.model_dir if None
get_global_step()[source]

Get the number of steps of fitting that have been performed.

static get_model_filename(model_dir)[source]

Given model directory, obtain filename for the model itself.

get_num_tasks()[source]

Get number of tasks.

get_params(deep=True)[source]

Get parameters for this estimator.

Parameters:deep (bool, default=True) – If True, will return the parameters for this estimator and contained subobjects that are estimators.
Returns:params – Parameter names mapped to their values.
Return type:mapping of string to any
static get_params_filename(model_dir)[source]

Given model directory, obtain filename for the model itself.

get_task_type()[source]

Currently models can only be classifiers or regressors.

load_from_pretrained(source_model, assignment_map=None, value_map=None, checkpoint=None, model_dir=None, include_top=True, inputs=None, **kwargs)[source]

Copies variable values from a pretrained model. source_model can either be a pretrained model or a model with the same architecture. value_map is a variable-value dictionary. If no value_map is provided, the variable values are restored to the source_model from a checkpoint and a default value_map is created. assignment_map is a dictionary mapping variables from the source_model to the current model. If no assignment_map is provided, one is made from scratch and assumes the model is composed of several different layers, with the final one being a dense layer. include_top is used to control whether or not the final dense layer is used. The default assignment map is useful in cases where the type of task is different (classification vs regression) and/or number of tasks in the setting.

Parameters:
  • source_model (dc.KerasModel, required) – source_model can either be the pretrained model or a dc.KerasModel with the same architecture as the pretrained model. It is used to restore from a checkpoint, if value_map is None and to create a default assignment map if assignment_map is None
  • assignment_map (Dict, default None) – Dictionary mapping the source_model variables and current model variables
  • value_map (Dict, default None) – Dictionary containing source_model trainable variables mapped to numpy arrays. If value_map is None, the values are restored and a default variable map is created using the restored values
  • checkpoint (str, default None) – the path to the checkpoint file to load. If this is None, the most recent checkpoint will be chosen automatically. Call get_checkpoints() to get a list of all available checkpoints
  • model_dir (str, default None) – Restore model from custom model directory if needed
  • include_top (bool, default True) – if True, copies the weights and bias associated with the final dense layer. Used only when assignment map is None
  • inputs (List, input tensors for model) – if not None, then the weights are built for both the source and self. This option is useful only for models that are built by subclassing tf.keras.Model, and not using the functional API by tf.keras
predict(dataset, transformers=[], outputs=None, output_types=None)[source]

Uses self to make predictions on provided Dataset object.

Parameters:
  • dataset (dc.data.Dataset) – Dataset to make prediction on
  • transformers (list of dc.trans.Transformers) – Transformers that the input data has been transformed by. The output is passed through these transformers to undo the transformations.
  • outputs (Tensor or list of Tensors) – The outputs to return. If this is None, the model’s standard prediction outputs will be returned. Alternatively one or more Tensors within the model may be specified, in which case the output of those Tensors will be returned.
  • output_types (list of Strings) – The output types to return. Will retrieve all outputs of these types from the model.
Returns:

  • a NumPy array of the model produces a single output, or a list of arrays
  • if it produces multiple outputs

predict_embedding(dataset)[source]

Predicts embeddings created by underlying model if any exist. An embedding must be specified to have output_type of ‘embedding’ in the model definition.

Parameters:dataset (dc.data.Dataset) – Dataset to make prediction on
Returns:
  • a NumPy array of the embeddings model produces, or a list
  • of arrays if it produces multiple embeddings
predict_embeddings(sequences)[source]

Given a set of input sequences, compute the embedding vectors.

Parameters:sequences (iterable) – the input sequences to generate an embedding vector for
predict_from_embeddings(embeddings, beam_width=5)[source]

Given a set of embedding vectors, predict the output sequences.

The prediction is done using a beam search with length normalization.

Parameters:
  • embeddings (iterable) – the embedding vectors to generate predictions for
  • beam_width (int) – the beam width to use for searching. Set to 1 to use a simple greedy search.
predict_from_sequences(sequences, beam_width=5)[source]

Given a set of input sequences, predict the output sequences.

The prediction is done using a beam search with length normalization.

Parameters:
  • sequences (iterable) – the input sequences to generate a prediction for
  • beam_width (int) – the beam width to use for searching. Set to 1 to use a simple greedy search.
predict_on_batch(X, transformers=[], outputs=None)[source]

Generates predictions for input samples, processing samples in a batch.

Parameters:
  • X (ndarray) – the input data, as a Numpy array.
  • transformers (list of dc.trans.Transformers) – Transformers that the input data has been transformed by. The output is passed through these transformers to undo the transformations.
  • outputs (Tensor or list of Tensors) – The outputs to return. If this is None, the model’s standard prediction outputs will be returned. Alternatively one or more Tensors within the model may be specified, in which case the output of those Tensors will be returned.
Returns:

  • a NumPy array of the model produces a single output, or a list of arrays
  • if it produces multiple outputs

predict_on_generator(generator, transformers=[], outputs=None, output_types=None)[source]
Parameters:
  • generator (generator) – this should generate batches, each represented as a tuple of the form (inputs, labels, weights).
  • transformers (list of dc.trans.Transformers) – Transformers that the input data has been transformed by. The output is passed through these transformers to undo the transformations.
  • outputs (Tensor or list of Tensors) – The outputs to return. If this is None, the model’s standard prediction outputs will be returned. Alternatively one or more Tensors within the model may be specified, in which case the output of those Tensors will be returned. If outputs is specified, output_types must be None.
  • output_types (String or list of Strings) – If specified, all outputs of this type will be retrieved from the model. If output_types is specified, outputs must be None.
  • Returns – a NumPy array of the model produces a single output, or a list of arrays if it produces multiple outputs
predict_uncertainty(dataset, masks=50)[source]

Predict the model’s outputs, along with the uncertainty in each one.

The uncertainty is computed as described in https://arxiv.org/abs/1703.04977. It involves repeating the prediction many times with different dropout masks. The prediction is computed as the average over all the predictions. The uncertainty includes both the variation among the predicted values (epistemic uncertainty) and the model’s own estimates for how well it fits the data (aleatoric uncertainty). Not all models support uncertainty prediction.

Parameters:
  • dataset (dc.data.Dataset) – Dataset to make prediction on
  • masks (int) – the number of dropout masks to average over
Returns:

  • for each output, a tuple (y_pred, y_std) where y_pred is the predicted
  • value of the output, and each element of y_std estimates the standard
  • deviation of the corresponding element of y_pred

predict_uncertainty_on_batch(X, masks=50)[source]

Predict the model’s outputs, along with the uncertainty in each one.

The uncertainty is computed as described in https://arxiv.org/abs/1703.04977. It involves repeating the prediction many times with different dropout masks. The prediction is computed as the average over all the predictions. The uncertainty includes both the variation among the predicted values (epistemic uncertainty) and the model’s own estimates for how well it fits the data (aleatoric uncertainty). Not all models support uncertainty prediction.

Parameters:
  • X (ndarray) – the input data, as a Numpy array.
  • masks (int) – the number of dropout masks to average over
Returns:

  • for each output, a tuple (y_pred, y_std) where y_pred is the predicted
  • value of the output, and each element of y_std estimates the standard
  • deviation of the corresponding element of y_pred

reload()[source]

Reload trained model from disk.

restore(checkpoint=None, model_dir=None, session=None)[source]

Reload the values of all variables from a checkpoint file.

Parameters:
  • checkpoint (str) – the path to the checkpoint file to load. If this is None, the most recent checkpoint will be chosen automatically. Call get_checkpoints() to get a list of all available checkpoints.
  • model_dir (str, default None) – Directory to restore checkpoint from. If None, use self.model_dir.
  • session (tf.Session(), default None) – Session to run restore ops under. If None, self.session is used.
save()[source]

Dispatcher function for saving.

Each subclass is responsible for overriding this method.

save_checkpoint(max_checkpoints_to_keep=5, model_dir=None)[source]

Save a checkpoint to disk.

Usually you do not need to call this method, since fit() saves checkpoints automatically. If you have disabled automatic checkpointing during fitting, this can be called to manually write checkpoints.

Parameters:
  • max_checkpoints_to_keep (int) – the maximum number of checkpoints to keep. Older checkpoints are discarded.
  • model_dir (str, default None) – Model directory to save checkpoint to. If None, revert to self.model_dir
set_params(**params)[source]

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters:**params (dict) – Estimator parameters.
Returns:self – Estimator instance.
Return type:object

GAN

class deepchem.models.GAN(n_generators=1, n_discriminators=1, **kwargs)[source]

Implements Generative Adversarial Networks.

A Generative Adversarial Network (GAN) is a type of generative model. It consists of two parts called the “generator” and the “discriminator”. The generator takes random noise as input and transforms it into an output that (hopefully) resembles the training data. The discriminator takes a set of samples as input and tries to distinguish the real training samples from the ones created by the generator. Both of them are trained together. The discriminator tries to get better and better at telling real from false data, while the generator tries to get better and better at fooling the discriminator.

In many cases there also are additional inputs to the generator and discriminator. In that case it is known as a Conditional GAN (CGAN), since it learns a distribution that is conditional on the values of those inputs. They are referred to as “conditional inputs”.

Many variations on this idea have been proposed, and new varieties of GANs are constantly being proposed. This class tries to make it very easy to implement straightforward GANs of the most conventional types. At the same time, it tries to be flexible enough that it can be used to implement many (but certainly not all) variations on the concept.

To define a GAN, you must create a subclass that provides implementations of the following methods:

get_noise_input_shape() get_data_input_shapes() create_generator() create_discriminator()

If you want your GAN to have any conditional inputs you must also implement:

get_conditional_input_shapes()

The following methods have default implementations that are suitable for most conventional GANs. You can override them if you want to customize their behavior:

create_generator_loss() create_discriminator_loss() get_noise_batch()

This class allows a GAN to have multiple generators and discriminators, a model known as MIX+GAN. It is described in Arora et al., “Generalization and Equilibrium in Generative Adversarial Nets (GANs)” (https://arxiv.org/abs/1703.00573). This can lead to better models, and is especially useful for reducing mode collapse, since different generators can learn different parts of the distribution. To use this technique, simply specify the number of generators and discriminators when calling the constructor. You can then tell predict_gan_generator() which generator to use for predicting samples.

__init__(n_generators=1, n_discriminators=1, **kwargs)[source]

Construct a GAN.

In addition to the parameters listed below, this class accepts all the keyword arguments from KerasModel.

Parameters:
  • n_generators (int) – the number of generators to include
  • n_discriminators (int) – the number of discriminators to include
compute_saliency(X)[source]

Compute the saliency map for an input sample.

This computes the Jacobian matrix with the derivative of each output element with respect to each input element. More precisely,

  • If this model has a single output, it returns a matrix of shape (output_shape, input_shape) with the derivatives.
  • If this model has multiple outputs, it returns a list of matrices, one for each output.

This method cannot be used on models that take multiple inputs.

Parameters:X (ndarray) – the input data for a single sample
Returns:
Return type:the Jacobian matrix, or a list of matrices
create_discriminator()[source]

Create and return a discriminator.

Subclasses must override this to construct the discriminator. The returned value should be a tf.keras.Model whose inputs are all data inputs, followed by any conditional inputs. Its output should be a one dimensional tensor containing the probability of each sample being a training sample.

create_discriminator_loss(discrim_output_train, discrim_output_gen)[source]

Create the loss function for the discriminator.

The default implementation is appropriate for most cases. Subclasses can override this if the need to customize it.

Parameters:
  • discrim_output_train (Tensor) – the output from the discriminator on a batch of training data. This is its estimate of the probability that each sample is training data.
  • discrim_output_gen (Tensor) – the output from the discriminator on a batch of generated data. This is its estimate of the probability that each sample is training data.
Returns:

Return type:

A Tensor equal to the loss function to use for optimizing the discriminator.

create_generator()[source]

Create and return a generator.

Subclasses must override this to construct the generator. The returned value should be a tf.keras.Model whose inputs are a batch of noise, followed by any conditional inputs. The number and shapes of its outputs must match the return value from get_data_input_shapes(), since generated data must have the same form as training data.

create_generator_loss(discrim_output)[source]

Create the loss function for the generator.

The default implementation is appropriate for most cases. Subclasses can override this if the need to customize it.

Parameters:discrim_output (Tensor) – the output from the discriminator on a batch of generated data. This is its estimate of the probability that each sample is training data.
Returns:
Return type:A Tensor equal to the loss function to use for optimizing the generator.
default_generator(dataset, epochs=1, mode='fit', deterministic=True, pad_batches=True)[source]

Create a generator that iterates batches for a dataset.

Subclasses may override this method to customize how model inputs are generated from the data.

Parameters:
  • dataset (Dataset) – the data to iterate
  • epochs (int) – the number of times to iterate over the full dataset
  • mode (str) – allowed values are ‘fit’ (called during training), ‘predict’ (called during prediction), and ‘uncertainty’ (called during uncertainty prediction)
  • deterministic (bool) – whether to iterate over the dataset in order, or randomly shuffle the data for each epoch
  • pad_batches (bool) – whether to pad each batch up to this model’s preferred batch size
Returns:

  • a generator that iterates batches, each represented as a tuple of lists
  • ([inputs], [outputs], [weights])

evaluate(dataset, metrics, transformers=[], per_task_metrics=False)[source]

Evaluates the performance of this model on specified dataset.

Parameters:
  • dataset (dc.data.Dataset) – Dataset object.
  • metric (deepchem.metrics.Metric) – Evaluation metric
  • transformers (list) – List of deepchem.transformers.Transformer
  • per_task_metrics (bool) – If True, return per-task scores.
Returns:

Maps tasks to scores under metric.

Return type:

dict

evaluate_generator(generator, metrics, transformers=[], per_task_metrics=False)[source]

Evaluate the performance of this model on the data produced by a generator.

Parameters:
  • generator (generator) – this should generate batches, each represented as a tuple of the form (inputs, labels, weights).
  • metric (deepchem.metrics.Metric) – Evaluation metric
  • transformers (list of dc.trans.Transformers) – Transformers that the input data has been transformed by. The output is passed through these transformers to undo the transformations.
  • per_task_metrics (bool) – If True, return per-task scores.
Returns:

Maps tasks to scores under metric.

Return type:

dict

fit(dataset, nb_epoch=10, max_checkpoints_to_keep=5, checkpoint_interval=1000, deterministic=False, restore=False, variables=None, loss=None, callbacks=[])[source]

Train this model on a dataset.

Parameters:
  • dataset (Dataset) – the Dataset to train on
  • nb_epoch (int) – the number of epochs to train for
  • max_checkpoints_to_keep (int) – the maximum number of checkpoints to keep. Older checkpoints are discarded.
  • checkpoint_interval (int) – the frequency at which to write checkpoints, measured in training steps. Set this to 0 to disable automatic checkpointing.
  • deterministic (bool) – if True, the samples are processed in order. If False, a different random order is used for each epoch.
  • restore (bool) – if True, restore the model from the most recent checkpoint and continue training from there. If False, retrain the model from scratch.
  • variables (list of tf.Variable) – the variables to train. If None (the default), all trainable variables in the model are used.
  • loss (function) – a function of the form f(outputs, labels, weights) that computes the loss for each batch. If None (the default), the model’s standard loss function is used.
  • callbacks (function or list of functions) – one or more functions of the form f(model, step) that will be invoked after every step. This can be used to perform validation, logging, etc.
fit_gan(batches, generator_steps=1.0, max_checkpoints_to_keep=5, checkpoint_interval=1000, restore=False)[source]

Train this model on data.

Parameters:
  • batches (iterable) – batches of data to train the discriminator on, each represented as a dict that maps Inputs to values. It should specify values for all members of data_inputs and conditional_inputs.
  • generator_steps (float) – the number of training steps to perform for the generator for each batch. This can be used to adjust the ratio of training steps for the generator and discriminator. For example, 2.0 will perform two training steps for every batch, while 0.5 will only perform one training step for every two batches.
  • max_checkpoints_to_keep (int) – the maximum number of checkpoints to keep. Older checkpoints are discarded.
  • checkpoint_interval (int) – the frequency at which to write checkpoints, measured in batches. Set this to 0 to disable automatic checkpointing.
  • restore (bool) – if True, restore the model from the most recent checkpoint before training it.
fit_generator(generator, max_checkpoints_to_keep=5, checkpoint_interval=1000, restore=False, variables=None, loss=None, callbacks=[])[source]

Train this model on data from a generator.

Parameters:
  • generator (generator) – this should generate batches, each represented as a tuple of the form (inputs, labels, weights).
  • max_checkpoints_to_keep (int) – the maximum number of checkpoints to keep. Older checkpoints are discarded.
  • checkpoint_interval (int) – the frequency at which to write checkpoints, measured in training steps. Set this to 0 to disable automatic checkpointing.
  • restore (bool) – if True, restore the model from the most recent checkpoint and continue training from there. If False, retrain the model from scratch.
  • variables (list of tf.Variable) – the variables to train. If None (the default), all trainable variables in the model are used.
  • loss (function) – a function of the form f(outputs, labels, weights) that computes the loss for each batch. If None (the default), the model’s standard loss function is used.
  • callbacks (function or list of functions) – one or more functions of the form f(model, step) that will be invoked after every step. This can be used to perform validation, logging, etc.
Returns:

Return type:

the average loss over the most recent checkpoint interval

fit_on_batch(X, y, w, variables=None, loss=None, callbacks=[], checkpoint=True, max_checkpoints_to_keep=5)[source]

Perform a single step of training.

Parameters:
  • X (ndarray) – the inputs for the batch
  • y (ndarray) – the labels for the batch
  • w (ndarray) – the weights for the batch
  • variables (list of tf.Variable) – the variables to train. If None (the default), all trainable variables in the model are used.
  • loss (function) – a function of the form f(outputs, labels, weights) that computes the loss for each batch. If None (the default), the model’s standard loss function is used.
  • callbacks (function or list of functions) – one or more functions of the form f(model, step) that will be invoked after every step. This can be used to perform validation, logging, etc.
  • checkpoint (bool) – if true, save a checkpoint after performing the training step
  • max_checkpoints_to_keep (int) – the maximum number of checkpoints to keep. Older checkpoints are discarded.
get_checkpoints(model_dir=None)[source]

Get a list of all available checkpoint files.

Parameters:model_dir (str, default None) – Directory to get list of checkpoints from. Reverts to self.model_dir if None
get_conditional_input_shapes()[source]

Get the shapes of any conditional inputs.

Subclasses may override this to return a list of tuples, each giving the shape of one of the conditional inputs. The actual Input layers will be created automatically. The dimension corresponding to the batch size should be omitted.

The default implementation returns an empty list, meaning there are no conditional inputs.

get_data_input_shapes()[source]

Get the shapes of the inputs for training data.

Subclasses must override this to return a list of tuples, each giving the shape of one of the inputs. The actual Input layers will be created automatically. This list of shapes must also match the shapes of the generator’s outputs. The dimension corresponding to the batch size should be omitted.

get_global_step()[source]

Get the number of steps of fitting that have been performed.

static get_model_filename(model_dir)[source]

Given model directory, obtain filename for the model itself.

get_noise_batch(batch_size)[source]

Get a batch of random noise to pass to the generator.

This should return a NumPy array whose shape matches the one returned by get_noise_input_shape(). The default implementation returns normally distributed values. Subclasses can override this to implement a different distribution.

get_noise_input_shape()[source]

Get the shape of the generator’s noise input layer.

Subclasses must override this to return a tuple giving the shape of the noise input. The actual Input layer will be created automatically. The dimension corresponding to the batch size should be omitted.

get_num_tasks()[source]

Get number of tasks.

get_params(deep=True)[source]

Get parameters for this estimator.

Parameters:deep (bool, default=True) – If True, will return the parameters for this estimator and contained subobjects that are estimators.
Returns:params – Parameter names mapped to their values.
Return type:mapping of string to any
static get_params_filename(model_dir)[source]

Given model directory, obtain filename for the model itself.

get_task_type()[source]

Currently models can only be classifiers or regressors.

load_from_pretrained(source_model, assignment_map=None, value_map=None, checkpoint=None, model_dir=None, include_top=True, inputs=None, **kwargs)[source]

Copies variable values from a pretrained model. source_model can either be a pretrained model or a model with the same architecture. value_map is a variable-value dictionary. If no value_map is provided, the variable values are restored to the source_model from a checkpoint and a default value_map is created. assignment_map is a dictionary mapping variables from the source_model to the current model. If no assignment_map is provided, one is made from scratch and assumes the model is composed of several different layers, with the final one being a dense layer. include_top is used to control whether or not the final dense layer is used. The default assignment map is useful in cases where the type of task is different (classification vs regression) and/or number of tasks in the setting.

Parameters:
  • source_model (dc.KerasModel, required) – source_model can either be the pretrained model or a dc.KerasModel with the same architecture as the pretrained model. It is used to restore from a checkpoint, if value_map is None and to create a default assignment map if assignment_map is None
  • assignment_map (Dict, default None) – Dictionary mapping the source_model variables and current model variables
  • value_map (Dict, default None) – Dictionary containing source_model trainable variables mapped to numpy arrays. If value_map is None, the values are restored and a default variable map is created using the restored values
  • checkpoint (str, default None) – the path to the checkpoint file to load. If this is None, the most recent checkpoint will be chosen automatically. Call get_checkpoints() to get a list of all available checkpoints
  • model_dir (str, default None) – Restore model from custom model directory if needed
  • include_top (bool, default True) – if True, copies the weights and bias associated with the final dense layer. Used only when assignment map is None
  • inputs (List, input tensors for model) – if not None, then the weights are built for both the source and self. This option is useful only for models that are built by subclassing tf.keras.Model, and not using the functional API by tf.keras
predict(dataset, transformers=[], outputs=None, output_types=None)[source]

Uses self to make predictions on provided Dataset object.

Parameters:
  • dataset (dc.data.Dataset) – Dataset to make prediction on
  • transformers (list of dc.trans.Transformers) – Transformers that the input data has been transformed by. The output is passed through these transformers to undo the transformations.
  • outputs (Tensor or list of Tensors) – The outputs to return. If this is None, the model’s standard prediction outputs will be returned. Alternatively one or more Tensors within the model may be specified, in which case the output of those Tensors will be returned.
  • output_types (list of Strings) – The output types to return. Will retrieve all outputs of these types from the model.
Returns:

  • a NumPy array of the model produces a single output, or a list of arrays
  • if it produces multiple outputs

predict_embedding(dataset)[source]

Predicts embeddings created by underlying model if any exist. An embedding must be specified to have output_type of ‘embedding’ in the model definition.

Parameters:dataset (dc.data.Dataset) – Dataset to make prediction on
Returns:
  • a NumPy array of the embeddings model produces, or a list
  • of arrays if it produces multiple embeddings
predict_gan_generator(batch_size=1, noise_input=None, conditional_inputs=[], generator_index=0)[source]

Use the GAN to generate a batch of samples.

Parameters:
  • batch_size (int) – the number of samples to generate. If either noise_input or conditional_inputs is specified, this argument is ignored since the batch size is then determined by the size of that argument.
  • noise_input (array) – the value to use for the generator’s noise input. If None (the default), get_noise_batch() is called to generate a random input, so each call will produce a new set of samples.
  • conditional_inputs (list of arrays) – the values to use for all conditional inputs. This must be specified if the GAN has any conditional inputs.
  • generator_index (int) – the index of the generator (between 0 and n_generators-1) to use for generating the samples.
Returns:

  • An array (if the generator has only one output) or list of arrays (if it has
  • multiple outputs) containing the generated samples.

predict_on_batch(X, transformers=[], outputs=None)[source]

Generates predictions for input samples, processing samples in a batch.

Parameters:
  • X (ndarray) – the input data, as a Numpy array.
  • transformers (list of dc.trans.Transformers) – Transformers that the input data has been transformed by. The output is passed through these transformers to undo the transformations.
  • outputs (Tensor or list of Tensors) – The outputs to return. If this is None, the model’s standard prediction outputs will be returned. Alternatively one or more Tensors within the model may be specified, in which case the output of those Tensors will be returned.
Returns:

  • a NumPy array of the model produces a single output, or a list of arrays
  • if it produces multiple outputs

predict_on_generator(generator, transformers=[], outputs=None, output_types=None)[source]
Parameters:
  • generator (generator) – this should generate batches, each represented as a tuple of the form (inputs, labels, weights).
  • transformers (list of dc.trans.Transformers) – Transformers that the input data has been transformed by. The output is passed through these transformers to undo the transformations.
  • outputs (Tensor or list of Tensors) – The outputs to return. If this is None, the model’s standard prediction outputs will be returned. Alternatively one or more Tensors within the model may be specified, in which case the output of those Tensors will be returned. If outputs is specified, output_types must be None.
  • output_types (String or list of Strings) – If specified, all outputs of this type will be retrieved from the model. If output_types is specified, outputs must be None.
  • Returns – a NumPy array of the model produces a single output, or a list of arrays if it produces multiple outputs
predict_uncertainty(dataset, masks=50)[source]

Predict the model’s outputs, along with the uncertainty in each one.

The uncertainty is computed as described in https://arxiv.org/abs/1703.04977. It involves repeating the prediction many times with different dropout masks. The prediction is computed as the average over all the predictions. The uncertainty includes both the variation among the predicted values (epistemic uncertainty) and the model’s own estimates for how well it fits the data (aleatoric uncertainty). Not all models support uncertainty prediction.

Parameters:
  • dataset (dc.data.Dataset) – Dataset to make prediction on
  • masks (int) – the number of dropout masks to average over
Returns:

  • for each output, a tuple (y_pred, y_std) where y_pred is the predicted
  • value of the output, and each element of y_std estimates the standard
  • deviation of the corresponding element of y_pred

predict_uncertainty_on_batch(X, masks=50)[source]

Predict the model’s outputs, along with the uncertainty in each one.

The uncertainty is computed as described in https://arxiv.org/abs/1703.04977. It involves repeating the prediction many times with different dropout masks. The prediction is computed as the average over all the predictions. The uncertainty includes both the variation among the predicted values (epistemic uncertainty) and the model’s own estimates for how well it fits the data (aleatoric uncertainty). Not all models support uncertainty prediction.

Parameters:
  • X (ndarray) – the input data, as a Numpy array.
  • masks (int) – the number of dropout masks to average over
Returns:

  • for each output, a tuple (y_pred, y_std) where y_pred is the predicted
  • value of the output, and each element of y_std estimates the standard
  • deviation of the corresponding element of y_pred

reload()[source]

Reload trained model from disk.

restore(checkpoint=None, model_dir=None, session=None)[source]

Reload the values of all variables from a checkpoint file.

Parameters:
  • checkpoint (str) – the path to the checkpoint file to load. If this is None, the most recent checkpoint will be chosen automatically. Call get_checkpoints() to get a list of all available checkpoints.
  • model_dir (str, default None) – Directory to restore checkpoint from. If None, use self.model_dir.
  • session (tf.Session(), default None) – Session to run restore ops under. If None, self.session is used.
save()[source]

Dispatcher function for saving.

Each subclass is responsible for overriding this method.

save_checkpoint(max_checkpoints_to_keep=5, model_dir=None)[source]

Save a checkpoint to disk.

Usually you do not need to call this method, since fit() saves checkpoints automatically. If you have disabled automatic checkpointing during fitting, this can be called to manually write checkpoints.

Parameters:
  • max_checkpoints_to_keep (int) – the maximum number of checkpoints to keep. Older checkpoints are discarded.
  • model_dir (str, default None) – Model directory to save checkpoint to. If None, revert to self.model_dir
set_params(**params)[source]

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters:**params (dict) – Estimator parameters.
Returns:self – Estimator instance.
Return type:object

WGAN

class deepchem.models.WGAN(gradient_penalty=10.0, **kwargs)[source]

Implements Wasserstein Generative Adversarial Networks.

This class implements Wasserstein Generative Adversarial Networks (WGANs) as described in Arjovsky et al., “Wasserstein GAN” (https://arxiv.org/abs/1701.07875). A WGAN is conceptually rather different from a conventional GAN, but in practical terms very similar. It reinterprets the discriminator (often called the “critic” in this context) as learning an approximation to the Earth Mover distance between the training and generated distributions. The generator is then trained to minimize that distance. In practice, this just means using slightly different loss functions for training the generator and discriminator.

WGANs have theoretical advantages over conventional GANs, and they often work better in practice. In addition, the discriminator’s loss function can be directly interpreted as a measure of the quality of the model. That is an advantage over conventional GANs, where the loss does not directly convey information about the quality of the model.

The theory WGANs are based on requires the discriminator’s gradient to be bounded. The original paper achieved this by clipping its weights. This class instead does it by adding a penalty term to the discriminator’s loss, as described in https://arxiv.org/abs/1704.00028. This is sometimes found to produce better results.

There are a few other practical differences between GANs and WGANs. In a conventional GAN, the discriminator’s output must be between 0 and 1 so it can be interpreted as a probability. In a WGAN, it should produce an unbounded output that can be interpreted as a distance.

When training a WGAN, you also should usually use a smaller value for generator_steps. Conventional GANs rely on keeping the generator and discriminator “in balance” with each other. If the discriminator ever gets too good, it becomes impossible for the generator to fool it and training stalls. WGANs do not have this problem, and in fact the better the discriminator is, the easier it is for the generator to improve. It therefore usually works best to perform several training steps on the discriminator for each training step on the generator.

__init__(gradient_penalty=10.0, **kwargs)[source]

Construct a WGAN.

In addition to the following, this class accepts all the keyword arguments from GAN and KerasModel.

Parameters:gradient_penalty (float) – the magnitude of the gradient penalty loss
compute_saliency(X)[source]

Compute the saliency map for an input sample.

This computes the Jacobian matrix with the derivative of each output element with respect to each input element. More precisely,

  • If this model has a single output, it returns a matrix of shape (output_shape, input_shape) with the derivatives.
  • If this model has multiple outputs, it returns a list of matrices, one for each output.

This method cannot be used on models that take multiple inputs.

Parameters:X (ndarray) – the input data for a single sample
Returns:
Return type:the Jacobian matrix, or a list of matrices
create_discriminator()[source]

Create and return a discriminator.

Subclasses must override this to construct the discriminator. The returned value should be a tf.keras.Model whose inputs are all data inputs, followed by any conditional inputs. Its output should be a one dimensional tensor containing the probability of each sample being a training sample.

create_discriminator_loss(discrim_output_train, discrim_output_gen)[source]

Create the loss function for the discriminator.

The default implementation is appropriate for most cases. Subclasses can override this if the need to customize it.

Parameters:
  • discrim_output_train (Tensor) – the output from the discriminator on a batch of training data. This is its estimate of the probability that each sample is training data.
  • discrim_output_gen (Tensor) – the output from the discriminator on a batch of generated data. This is its estimate of the probability that each sample is training data.
Returns:

Return type:

A Tensor equal to the loss function to use for optimizing the discriminator.

create_generator()[source]

Create and return a generator.

Subclasses must override this to construct the generator. The returned value should be a tf.keras.Model whose inputs are a batch of noise, followed by any conditional inputs. The number and shapes of its outputs must match the return value from get_data_input_shapes(), since generated data must have the same form as training data.

create_generator_loss(discrim_output)[source]

Create the loss function for the generator.

The default implementation is appropriate for most cases. Subclasses can override this if the need to customize it.

Parameters:discrim_output (Tensor) – the output from the discriminator on a batch of generated data. This is its estimate of the probability that each sample is training data.
Returns:
Return type:A Tensor equal to the loss function to use for optimizing the generator.
default_generator(dataset, epochs=1, mode='fit', deterministic=True, pad_batches=True)[source]

Create a generator that iterates batches for a dataset.

Subclasses may override this method to customize how model inputs are generated from the data.

Parameters:
  • dataset (Dataset) – the data to iterate
  • epochs (int) – the number of times to iterate over the full dataset
  • mode (str) – allowed values are ‘fit’ (called during training), ‘predict’ (called during prediction), and ‘uncertainty’ (called during uncertainty prediction)
  • deterministic (bool) – whether to iterate over the dataset in order, or randomly shuffle the data for each epoch
  • pad_batches (bool) – whether to pad each batch up to this model’s preferred batch size
Returns:

  • a generator that iterates batches, each represented as a tuple of lists
  • ([inputs], [outputs], [weights])

evaluate(dataset, metrics, transformers=[], per_task_metrics=False)[source]

Evaluates the performance of this model on specified dataset.

Parameters:
  • dataset (dc.data.Dataset) – Dataset object.
  • metric (deepchem.metrics.Metric) – Evaluation metric
  • transformers (list) – List of deepchem.transformers.Transformer
  • per_task_metrics (bool) – If True, return per-task scores.
Returns:

Maps tasks to scores under metric.

Return type:

dict

evaluate_generator(generator, metrics, transformers=[], per_task_metrics=False)[source]

Evaluate the performance of this model on the data produced by a generator.

Parameters:
  • generator (generator) – this should generate batches, each represented as a tuple of the form (inputs, labels, weights).
  • metric (deepchem.metrics.Metric) – Evaluation metric
  • transformers (list of dc.trans.Transformers) – Transformers that the input data has been transformed by. The output is passed through these transformers to undo the transformations.
  • per_task_metrics (bool) – If True, return per-task scores.
Returns:

Maps tasks to scores under metric.

Return type:

dict

fit(dataset, nb_epoch=10, max_checkpoints_to_keep=5, checkpoint_interval=1000, deterministic=False, restore=False, variables=None, loss=None, callbacks=[])[source]

Train this model on a dataset.

Parameters:
  • dataset (Dataset) – the Dataset to train on
  • nb_epoch (int) – the number of epochs to train for
  • max_checkpoints_to_keep (int) – the maximum number of checkpoints to keep. Older checkpoints are discarded.
  • checkpoint_interval (int) – the frequency at which to write checkpoints, measured in training steps. Set this to 0 to disable automatic checkpointing.
  • deterministic (bool) – if True, the samples are processed in order. If False, a different random order is used for each epoch.
  • restore (bool) – if True, restore the model from the most recent checkpoint and continue training from there. If False, retrain the model from scratch.
  • variables (list of tf.Variable) – the variables to train. If None (the default), all trainable variables in the model are used.
  • loss (function) – a function of the form f(outputs, labels, weights) that computes the loss for each batch. If None (the default), the model’s standard loss function is used.
  • callbacks (function or list of functions) – one or more functions of the form f(model, step) that will be invoked after every step. This can be used to perform validation, logging, etc.
fit_gan(batches, generator_steps=1.0, max_checkpoints_to_keep=5, checkpoint_interval=1000, restore=False)[source]

Train this model on data.

Parameters:
  • batches (iterable) – batches of data to train the discriminator on, each represented as a dict that maps Inputs to values. It should specify values for all members of data_inputs and conditional_inputs.
  • generator_steps (float) – the number of training steps to perform for the generator for each batch. This can be used to adjust the ratio of training steps for the generator and discriminator. For example, 2.0 will perform two training steps for every batch, while 0.5 will only perform one training step for every two batches.
  • max_checkpoints_to_keep (int) – the maximum number of checkpoints to keep. Older checkpoints are discarded.
  • checkpoint_interval (int) – the frequency at which to write checkpoints, measured in batches. Set this to 0 to disable automatic checkpointing.
  • restore (bool) – if True, restore the model from the most recent checkpoint before training it.
fit_generator(generator, max_checkpoints_to_keep=5, checkpoint_interval=1000, restore=False, variables=None, loss=None, callbacks=[])[source]

Train this model on data from a generator.

Parameters:
  • generator (generator) – this should generate batches, each represented as a tuple of the form (inputs, labels, weights).
  • max_checkpoints_to_keep (int) – the maximum number of checkpoints to keep. Older checkpoints are discarded.
  • checkpoint_interval (int) – the frequency at which to write checkpoints, measured in training steps. Set this to 0 to disable automatic checkpointing.
  • restore (bool) – if True, restore the model from the most recent checkpoint and continue training from there. If False, retrain the model from scratch.
  • variables (list of tf.Variable) – the variables to train. If None (the default), all trainable variables in the model are used.
  • loss (function) – a function of the form f(outputs, labels, weights) that computes the loss for each batch. If None (the default), the model’s standard loss function is used.
  • callbacks (function or list of functions) – one or more functions of the form f(model, step) that will be invoked after every step. This can be used to perform validation, logging, etc.
Returns:

Return type:

the average loss over the most recent checkpoint interval

fit_on_batch(X, y, w, variables=None, loss=None, callbacks=[], checkpoint=True, max_checkpoints_to_keep=5)[source]

Perform a single step of training.

Parameters:
  • X (ndarray) – the inputs for the batch
  • y (ndarray) – the labels for the batch
  • w (ndarray) – the weights for the batch
  • variables (list of tf.Variable) – the variables to train. If None (the default), all trainable variables in the model are used.
  • loss (function) – a function of the form f(outputs, labels, weights) that computes the loss for each batch. If None (the default), the model’s standard loss function is used.
  • callbacks (function or list of functions) – one or more functions of the form f(model, step) that will be invoked after every step. This can be used to perform validation, logging, etc.
  • checkpoint (bool) – if true, save a checkpoint after performing the training step
  • max_checkpoints_to_keep (int) – the maximum number of checkpoints to keep. Older checkpoints are discarded.
get_checkpoints(model_dir=None)[source]

Get a list of all available checkpoint files.

Parameters:model_dir (str, default None) – Directory to get list of checkpoints from. Reverts to self.model_dir if None
get_conditional_input_shapes()[source]

Get the shapes of any conditional inputs.

Subclasses may override this to return a list of tuples, each giving the shape of one of the conditional inputs. The actual Input layers will be created automatically. The dimension corresponding to the batch size should be omitted.

The default implementation returns an empty list, meaning there are no conditional inputs.

get_data_input_shapes()[source]

Get the shapes of the inputs for training data.

Subclasses must override this to return a list of tuples, each giving the shape of one of the inputs. The actual Input layers will be created automatically. This list of shapes must also match the shapes of the generator’s outputs. The dimension corresponding to the batch size should be omitted.

get_global_step()[source]

Get the number of steps of fitting that have been performed.

static get_model_filename(model_dir)[source]

Given model directory, obtain filename for the model itself.

get_noise_batch(batch_size)[source]

Get a batch of random noise to pass to the generator.

This should return a NumPy array whose shape matches the one returned by get_noise_input_shape(). The default implementation returns normally distributed values. Subclasses can override this to implement a different distribution.

get_noise_input_shape()[source]

Get the shape of the generator’s noise input layer.

Subclasses must override this to return a tuple giving the shape of the noise input. The actual Input layer will be created automatically. The dimension corresponding to the batch size should be omitted.

get_num_tasks()[source]

Get number of tasks.

get_params(deep=True)[source]

Get parameters for this estimator.

Parameters:deep (bool, default=True) – If True, will return the parameters for this estimator and contained subobjects that are estimators.
Returns:params – Parameter names mapped to their values.
Return type:mapping of string to any
static get_params_filename(model_dir)[source]

Given model directory, obtain filename for the model itself.

get_task_type()[source]

Currently models can only be classifiers or regressors.

load_from_pretrained(source_model, assignment_map=None, value_map=None, checkpoint=None, model_dir=None, include_top=True, inputs=None, **kwargs)[source]

Copies variable values from a pretrained model. source_model can either be a pretrained model or a model with the same architecture. value_map is a variable-value dictionary. If no value_map is provided, the variable values are restored to the source_model from a checkpoint and a default value_map is created. assignment_map is a dictionary mapping variables from the source_model to the current model. If no assignment_map is provided, one is made from scratch and assumes the model is composed of several different layers, with the final one being a dense layer. include_top is used to control whether or not the final dense layer is used. The default assignment map is useful in cases where the type of task is different (classification vs regression) and/or number of tasks in the setting.

Parameters:
  • source_model (dc.KerasModel, required) – source_model can either be the pretrained model or a dc.KerasModel with the same architecture as the pretrained model. It is used to restore from a checkpoint, if value_map is None and to create a default assignment map if assignment_map is None
  • assignment_map (Dict, default None) – Dictionary mapping the source_model variables and current model variables
  • value_map (Dict, default None) – Dictionary containing source_model trainable variables mapped to numpy arrays. If value_map is None, the values are restored and a default variable map is created using the restored values
  • checkpoint (str, default None) – the path to the checkpoint file to load. If this is None, the most recent checkpoint will be chosen automatically. Call get_checkpoints() to get a list of all available checkpoints
  • model_dir (str, default None) – Restore model from custom model directory if needed
  • include_top (bool, default True) – if True, copies the weights and bias associated with the final dense layer. Used only when assignment map is None
  • inputs (List, input tensors for model) – if not None, then the weights are built for both the source and self. This option is useful only for models that are built by subclassing tf.keras.Model, and not using the functional API by tf.keras
predict(dataset, transformers=[], outputs=None, output_types=None)[source]

Uses self to make predictions on provided Dataset object.

Parameters:
  • dataset (dc.data.Dataset) – Dataset to make prediction on
  • transformers (list of dc.trans.Transformers) – Transformers that the input data has been transformed by. The output is passed through these transformers to undo the transformations.
  • outputs (Tensor or list of Tensors) – The outputs to return. If this is None, the model’s standard prediction outputs will be returned. Alternatively one or more Tensors within the model may be specified, in which case the output of those Tensors will be returned.
  • output_types (list of Strings) – The output types to return. Will retrieve all outputs of these types from the model.
Returns:

  • a NumPy array of the model produces a single output, or a list of arrays
  • if it produces multiple outputs

predict_embedding(dataset)[source]

Predicts embeddings created by underlying model if any exist. An embedding must be specified to have output_type of ‘embedding’ in the model definition.

Parameters:dataset (dc.data.Dataset) – Dataset to make prediction on
Returns:
  • a NumPy array of the embeddings model produces, or a list
  • of arrays if it produces multiple embeddings
predict_gan_generator(batch_size=1, noise_input=None, conditional_inputs=[], generator_index=0)[source]

Use the GAN to generate a batch of samples.

Parameters:
  • batch_size (int) – the number of samples to generate. If either noise_input or conditional_inputs is specified, this argument is ignored since the batch size is then determined by the size of that argument.
  • noise_input (array) – the value to use for the generator’s noise input. If None (the default), get_noise_batch() is called to generate a random input, so each call will produce a new set of samples.
  • conditional_inputs (list of arrays) – the values to use for all conditional inputs. This must be specified if the GAN has any conditional inputs.
  • generator_index (int) – the index of the generator (between 0 and n_generators-1) to use for generating the samples.
Returns:

  • An array (if the generator has only one output) or list of arrays (if it has
  • multiple outputs) containing the generated samples.

predict_on_batch(X, transformers=[], outputs=None)[source]

Generates predictions for input samples, processing samples in a batch.

Parameters:
  • X (ndarray) – the input data, as a Numpy array.
  • transformers (list of dc.trans.Transformers) – Transformers that the input data has been transformed by. The output is passed through these transformers to undo the transformations.
  • outputs (Tensor or list of Tensors) – The outputs to return. If this is None, the model’s standard prediction outputs will be returned. Alternatively one or more Tensors within the model may be specified, in which case the output of those Tensors will be returned.
Returns:

  • a NumPy array of the model produces a single output, or a list of arrays
  • if it produces multiple outputs

predict_on_generator(generator, transformers=[], outputs=None, output_types=None)[source]
Parameters:
  • generator (generator) – this should generate batches, each represented as a tuple of the form (inputs, labels, weights).
  • transformers (list of dc.trans.Transformers) – Transformers that the input data has been transformed by. The output is passed through these transformers to undo the transformations.
  • outputs (Tensor or list of Tensors) – The outputs to return. If this is None, the model’s standard prediction outputs will be returned. Alternatively one or more Tensors within the model may be specified, in which case the output of those Tensors will be returned. If outputs is specified, output_types must be None.
  • output_types (String or list of Strings) – If specified, all outputs of this type will be retrieved from the model. If output_types is specified, outputs must be None.
  • Returns – a NumPy array of the model produces a single output, or a list of arrays if it produces multiple outputs
predict_uncertainty(dataset, masks=50)[source]

Predict the model’s outputs, along with the uncertainty in each one.

The uncertainty is computed as described in https://arxiv.org/abs/1703.04977. It involves repeating the prediction many times with different dropout masks. The prediction is computed as the average over all the predictions. The uncertainty includes both the variation among the predicted values (epistemic uncertainty) and the model’s own estimates for how well it fits the data (aleatoric uncertainty). Not all models support uncertainty prediction.

Parameters:
  • dataset (dc.data.Dataset) – Dataset to make prediction on
  • masks (int) – the number of dropout masks to average over
Returns:

  • for each output, a tuple (y_pred, y_std) where y_pred is the predicted
  • value of the output, and each element of y_std estimates the standard
  • deviation of the corresponding element of y_pred

predict_uncertainty_on_batch(X, masks=50)[source]

Predict the model’s outputs, along with the uncertainty in each one.

The uncertainty is computed as described in https://arxiv.org/abs/1703.04977. It involves repeating the prediction many times with different dropout masks. The prediction is computed as the average over all the predictions. The uncertainty includes both the variation among the predicted values (epistemic uncertainty) and the model’s own estimates for how well it fits the data (aleatoric uncertainty). Not all models support uncertainty prediction.

Parameters:
  • X (ndarray) – the input data, as a Numpy array.
  • masks (int) – the number of dropout masks to average over
Returns:

  • for each output, a tuple (y_pred, y_std) where y_pred is the predicted
  • value of the output, and each element of y_std estimates the standard
  • deviation of the corresponding element of y_pred

reload()[source]

Reload trained model from disk.

restore(checkpoint=None, model_dir=None, session=None)[source]

Reload the values of all variables from a checkpoint file.

Parameters:
  • checkpoint (str) – the path to the checkpoint file to load. If this is None, the most recent checkpoint will be chosen automatically. Call get_checkpoints() to get a list of all available checkpoints.
  • model_dir (str, default None) – Directory to restore checkpoint from. If None, use self.model_dir.
  • session (tf.Session(), default None) – Session to run restore ops under. If None, self.session is used.
save()[source]

Dispatcher function for saving.

Each subclass is responsible for overriding this method.

save_checkpoint(max_checkpoints_to_keep=5, model_dir=None)[source]

Save a checkpoint to disk.

Usually you do not need to call this method, since fit() saves checkpoints automatically. If you have disabled automatic checkpointing during fitting, this can be called to manually write checkpoints.

Parameters:
  • max_checkpoints_to_keep (int) – the maximum number of checkpoints to keep. Older checkpoints are discarded.
  • model_dir (str, default None) – Model directory to save checkpoint to. If None, revert to self.model_dir
set_params(**params)[source]

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters:**params (dict) – Estimator parameters.
Returns:self – Estimator instance.
Return type:object

CNN

class deepchem.models.CNN(n_tasks, n_features, dims, layer_filters=[100], kernel_size=5, strides=1, weight_init_stddevs=0.02, bias_init_consts=1.0, weight_decay_penalty=0.0, weight_decay_penalty_type='l2', dropouts=0.5, activation_fns=<function relu>, dense_layer_size=1000, pool_type='max', mode='classification', n_classes=2, uncertainty=False, residual=False, padding='valid', **kwargs)[source]

A 1, 2, or 3 dimensional convolutional network for either regression or classification.

The network consists of the following sequence of layers:

  • A configurable number of convolutional layers
  • A global pooling layer (either max pool or average pool)
  • A final dense layer to compute the output

It optionally can compose the model from pre-activation residual blocks, as described in https://arxiv.org/abs/1603.05027, rather than a simple stack of convolution layers. This often leads to easier training, especially when using a large number of layers. Note that residual blocks can only be used when successive layers have the same output shape. Wherever the output shape changes, a simple convolution layer will be used even if residual=True.

__init__(n_tasks, n_features, dims, layer_filters=[100], kernel_size=5, strides=1, weight_init_stddevs=0.02, bias_init_consts=1.0, weight_decay_penalty=0.0, weight_decay_penalty_type='l2', dropouts=0.5, activation_fns=<function relu>, dense_layer_size=1000, pool_type='max', mode='classification', n_classes=2, uncertainty=False, residual=False, padding='valid', **kwargs)[source]

Create a CNN.

In addition to the following arguments, this class also accepts all the keyword arguments from TensorGraph.

Parameters:
  • n_tasks (int) – number of tasks
  • n_features (int) – number of features
  • dims (int) – the number of dimensions to apply convolutions over (1, 2, or 3)
  • layer_filters (list) – the number of output filters for each convolutional layer in the network. The length of this list determines the number of layers.
  • kernel_size (int, tuple, or list) – a list giving the shape of the convolutional kernel for each layer. Each element may be either an int (use the same kernel width for every dimension) or a tuple (the kernel width along each dimension). Alternatively this may be a single int or tuple instead of a list, in which case the same kernel shape is used for every layer.
  • strides (int, tuple, or list) – a list giving the stride between applications of the kernel for each layer. Each element may be either an int (use the same stride for every dimension) or a tuple (the stride along each dimension). Alternatively this may be a single int or tuple instead of a list, in which case the same stride is used for every layer.
  • weight_init_stddevs (list or float) – the standard deviation of the distribution to use for weight initialization of each layer. The length of this list should equal len(layer_filters)+1, where the final element corresponds to the dense layer. Alternatively this may be a single value instead of a list, in which case the same value is used for every layer.
  • bias_init_consts (list or loat) – the value to initialize the biases in each layer to. The length of this list should equal len(layer_filters)+1, where the final element corresponds to the dense layer. Alternatively this may be a single value instead of a list, in which case the same value is used for every layer.
  • weight_decay_penalty (float) – the magnitude of the weight decay penalty to use
  • weight_decay_penalty_type (str) – the type of penalty to use for weight decay, either ‘l1’ or ‘l2’
  • dropouts (list or float) – the dropout probablity to use for each layer. The length of this list should equal len(layer_filters). Alternatively this may be a single value instead of a list, in which case the same value is used for every layer.
  • activation_fns (list or object) – the Tensorflow activation function to apply to each layer. The length of this list should equal len(layer_filters). Alternatively this may be a single value instead of a list, in which case the same value is used for every layer.
  • pool_type (str) – the type of pooling layer to use, either ‘max’ or ‘average’
  • mode (str) – Either ‘classification’ or ‘regression’
  • n_classes (int) – the number of classes to predict (only used in classification mode)
  • uncertainty (bool) – if True, include extra outputs and loss terms to enable the uncertainty in outputs to be predicted
  • residual (bool) – if True, the model will be composed of pre-activation residual blocks instead of a simple stack of convolutional layers.
  • padding (str) – the type of padding to use for convolutional layers, either ‘valid’ or ‘same’
compute_saliency(X)[source]

Compute the saliency map for an input sample.

This computes the Jacobian matrix with the derivative of each output element with respect to each input element. More precisely,

  • If this model has a single output, it returns a matrix of shape (output_shape, input_shape) with the derivatives.
  • If this model has multiple outputs, it returns a list of matrices, one for each output.

This method cannot be used on models that take multiple inputs.

Parameters:X (ndarray) – the input data for a single sample
Returns:
Return type:the Jacobian matrix, or a list of matrices
default_generator(dataset, epochs=1, mode='fit', deterministic=True, pad_batches=True)[source]

Create a generator that iterates batches for a dataset.

Subclasses may override this method to customize how model inputs are generated from the data.

Parameters:
  • dataset (Dataset) – the data to iterate
  • epochs (int) – the number of times to iterate over the full dataset
  • mode (str) – allowed values are ‘fit’ (called during training), ‘predict’ (called during prediction), and ‘uncertainty’ (called during uncertainty prediction)
  • deterministic (bool) – whether to iterate over the dataset in order, or randomly shuffle the data for each epoch
  • pad_batches (bool) – whether to pad each batch up to this model’s preferred batch size
Returns:

  • a generator that iterates batches, each represented as a tuple of lists
  • ([inputs], [outputs], [weights])

evaluate(dataset, metrics, transformers=[], per_task_metrics=False)[source]

Evaluates the performance of this model on specified dataset.

Parameters:
  • dataset (dc.data.Dataset) – Dataset object.
  • metric (deepchem.metrics.Metric) – Evaluation metric
  • transformers (list) – List of deepchem.transformers.Transformer
  • per_task_metrics (bool) – If True, return per-task scores.
Returns:

Maps tasks to scores under metric.

Return type:

dict

evaluate_generator(generator, metrics, transformers=[], per_task_metrics=False)[source]

Evaluate the performance of this model on the data produced by a generator.

Parameters:
  • generator (generator) – this should generate batches, each represented as a tuple of the form (inputs, labels, weights).
  • metric (deepchem.metrics.Metric) – Evaluation metric
  • transformers (list of dc.trans.Transformers) – Transformers that the input data has been transformed by. The output is passed through these transformers to undo the transformations.
  • per_task_metrics (bool) – If True, return per-task scores.
Returns:

Maps tasks to scores under metric.

Return type:

dict

fit(dataset, nb_epoch=10, max_checkpoints_to_keep=5, checkpoint_interval=1000, deterministic=False, restore=False, variables=None, loss=None, callbacks=[])[source]

Train this model on a dataset.

Parameters:
  • dataset (Dataset) – the Dataset to train on
  • nb_epoch (int) – the number of epochs to train for
  • max_checkpoints_to_keep (int) – the maximum number of checkpoints to keep. Older checkpoints are discarded.
  • checkpoint_interval (int) – the frequency at which to write checkpoints, measured in training steps. Set this to 0 to disable automatic checkpointing.
  • deterministic (bool) – if True, the samples are processed in order. If False, a different random order is used for each epoch.
  • restore (bool) – if True, restore the model from the most recent checkpoint and continue training from there. If False, retrain the model from scratch.
  • variables (list of tf.Variable) – the variables to train. If None (the default), all trainable variables in the model are used.
  • loss (function) – a function of the form f(outputs, labels, weights) that computes the loss for each batch. If None (the default), the model’s standard loss function is used.
  • callbacks (function or list of functions) – one or more functions of the form f(model, step) that will be invoked after every step. This can be used to perform validation, logging, etc.
fit_generator(generator, max_checkpoints_to_keep=5, checkpoint_interval=1000, restore=False, variables=None, loss=None, callbacks=[])[source]

Train this model on data from a generator.

Parameters:
  • generator (generator) – this should generate batches, each represented as a tuple of the form (inputs, labels, weights).
  • max_checkpoints_to_keep (int) – the maximum number of checkpoints to keep. Older checkpoints are discarded.
  • checkpoint_interval (int) – the frequency at which to write checkpoints, measured in training steps. Set this to 0 to disable automatic checkpointing.
  • restore (bool) – if True, restore the model from the most recent checkpoint and continue training from there. If False, retrain the model from scratch.
  • variables (list of tf.Variable) – the variables to train. If None (the default), all trainable variables in the model are used.
  • loss (function) – a function of the form f(outputs, labels, weights) that computes the loss for each batch. If None (the default), the model’s standard loss function is used.
  • callbacks (function or list of functions) – one or more functions of the form f(model, step) that will be invoked after every step. This can be used to perform validation, logging, etc.
Returns:

Return type:

the average loss over the most recent checkpoint interval

fit_on_batch(X, y, w, variables=None, loss=None, callbacks=[], checkpoint=True, max_checkpoints_to_keep=5)[source]

Perform a single step of training.

Parameters:
  • X (ndarray) – the inputs for the batch
  • y (ndarray) – the labels for the batch
  • w (ndarray) – the weights for the batch
  • variables (list of tf.Variable) – the variables to train. If None (the default), all trainable variables in the model are used.
  • loss (function) – a function of the form f(outputs, labels, weights) that computes the loss for each batch. If None (the default), the model’s standard loss function is used.
  • callbacks (function or list of functions) – one or more functions of the form f(model, step) that will be invoked after every step. This can be used to perform validation, logging, etc.
  • checkpoint (bool) – if true, save a checkpoint after performing the training step
  • max_checkpoints_to_keep (int) – the maximum number of checkpoints to keep. Older checkpoints are discarded.
get_checkpoints(model_dir=None)[source]

Get a list of all available checkpoint files.

Parameters:model_dir (str, default None) – Directory to get list of checkpoints from. Reverts to self.model_dir if None
get_global_step()[source]

Get the number of steps of fitting that have been performed.

static get_model_filename(model_dir)[source]

Given model directory, obtain filename for the model itself.

get_num_tasks()[source]

Get number of tasks.

get_params(deep=True)[source]

Get parameters for this estimator.

Parameters:deep (bool, default=True) – If True, will return the parameters for this estimator and contained subobjects that are estimators.
Returns:params – Parameter names mapped to their values.
Return type:mapping of string to any
static get_params_filename(model_dir)[source]

Given model directory, obtain filename for the model itself.

get_task_type()[source]

Currently models can only be classifiers or regressors.

load_from_pretrained(source_model, assignment_map=None, value_map=None, checkpoint=None, model_dir=None, include_top=True, inputs=None, **kwargs)[source]

Copies variable values from a pretrained model. source_model can either be a pretrained model or a model with the same architecture. value_map is a variable-value dictionary. If no value_map is provided, the variable values are restored to the source_model from a checkpoint and a default value_map is created. assignment_map is a dictionary mapping variables from the source_model to the current model. If no assignment_map is provided, one is made from scratch and assumes the model is composed of several different layers, with the final one being a dense layer. include_top is used to control whether or not the final dense layer is used. The default assignment map is useful in cases where the type of task is different (classification vs regression) and/or number of tasks in the setting.

Parameters:
  • source_model (dc.KerasModel, required) – source_model can either be the pretrained model or a dc.KerasModel with the same architecture as the pretrained model. It is used to restore from a checkpoint, if value_map is None and to create a default assignment map if assignment_map is None
  • assignment_map (Dict, default None) – Dictionary mapping the source_model variables and current model variables
  • value_map (Dict, default None) – Dictionary containing source_model trainable variables mapped to numpy arrays. If value_map is None, the values are restored and a default variable map is created using the restored values
  • checkpoint (str, default None) – the path to the checkpoint file to load. If this is None, the most recent checkpoint will be chosen automatically. Call get_checkpoints() to get a list of all available checkpoints
  • model_dir (str, default None) – Restore model from custom model directory if needed
  • include_top (bool, default True) – if True, copies the weights and bias associated with the final dense layer. Used only when assignment map is None
  • inputs (List, input tensors for model) – if not None, then the weights are built for both the source and self. This option is useful only for models that are built by subclassing tf.keras.Model, and not using the functional API by tf.keras
predict(dataset, transformers=[], outputs=None, output_types=None)[source]

Uses self to make predictions on provided Dataset object.

Parameters:
  • dataset (dc.data.Dataset) – Dataset to make prediction on
  • transformers (list of dc.trans.Transformers) – Transformers that the input data has been transformed by. The output is passed through these transformers to undo the transformations.
  • outputs (Tensor or list of Tensors) – The outputs to return. If this is None, the model’s standard prediction outputs will be returned. Alternatively one or more Tensors within the model may be specified, in which case the output of those Tensors will be returned.
  • output_types (list of Strings) – The output types to return. Will retrieve all outputs of these types from the model.
Returns:

  • a NumPy array of the model produces a single output, or a list of arrays
  • if it produces multiple outputs

predict_embedding(dataset)[source]

Predicts embeddings created by underlying model if any exist. An embedding must be specified to have output_type of ‘embedding’ in the model definition.

Parameters:dataset (dc.data.Dataset) – Dataset to make prediction on
Returns:
  • a NumPy array of the embeddings model produces, or a list
  • of arrays if it produces multiple embeddings
predict_on_batch(X, transformers=[], outputs=None)[source]

Generates predictions for input samples, processing samples in a batch.

Parameters:
  • X (ndarray) – the input data, as a Numpy array.
  • transformers (list of dc.trans.Transformers) – Transformers that the input data has been transformed by. The output is passed through these transformers to undo the transformations.
  • outputs (Tensor or list of Tensors) – The outputs to return. If this is None, the model’s standard prediction outputs will be returned. Alternatively one or more Tensors within the model may be specified, in which case the output of those Tensors will be returned.
Returns:

  • a NumPy array of the model produces a single output, or a list of arrays
  • if it produces multiple outputs

predict_on_generator(generator, transformers=[], outputs=None, output_types=None)[source]
Parameters:
  • generator (generator) – this should generate batches, each represented as a tuple of the form (inputs, labels, weights).
  • transformers (list of dc.trans.Transformers) – Transformers that the input data has been transformed by. The output is passed through these transformers to undo the transformations.
  • outputs (Tensor or list of Tensors) – The outputs to return. If this is None, the model’s standard prediction outputs will be returned. Alternatively one or more Tensors within the model may be specified, in which case the output of those Tensors will be returned. If outputs is specified, output_types must be None.
  • output_types (String or list of Strings) – If specified, all outputs of this type will be retrieved from the model. If output_types is specified, outputs must be None.
  • Returns – a NumPy array of the model produces a single output, or a list of arrays if it produces multiple outputs
predict_uncertainty(dataset, masks=50)[source]

Predict the model’s outputs, along with the uncertainty in each one.

The uncertainty is computed as described in https://arxiv.org/abs/1703.04977. It involves repeating the prediction many times with different dropout masks. The prediction is computed as the average over all the predictions. The uncertainty includes both the variation among the predicted values (epistemic uncertainty) and the model’s own estimates for how well it fits the data (aleatoric uncertainty). Not all models support uncertainty prediction.

Parameters:
  • dataset (dc.data.Dataset) – Dataset to make prediction on
  • masks (int) – the number of dropout masks to average over
Returns:

  • for each output, a tuple (y_pred, y_std) where y_pred is the predicted
  • value of the output, and each element of y_std estimates the standard
  • deviation of the corresponding element of y_pred

predict_uncertainty_on_batch(X, masks=50)[source]

Predict the model’s outputs, along with the uncertainty in each one.

The uncertainty is computed as described in https://arxiv.org/abs/1703.04977. It involves repeating the prediction many times with different dropout masks. The prediction is computed as the average over all the predictions. The uncertainty includes both the variation among the predicted values (epistemic uncertainty) and the model’s own estimates for how well it fits the data (aleatoric uncertainty). Not all models support uncertainty prediction.

Parameters:
  • X (ndarray) – the input data, as a Numpy array.
  • masks (int) – the number of dropout masks to average over
Returns:

  • for each output, a tuple (y_pred, y_std) where y_pred is the predicted
  • value of the output, and each element of y_std estimates the standard
  • deviation of the corresponding element of y_pred

reload()[source]

Reload trained model from disk.

restore(checkpoint=None, model_dir=None, session=None)[source]

Reload the values of all variables from a checkpoint file.

Parameters:
  • checkpoint (str) – the path to the checkpoint file to load. If this is None, the most recent checkpoint will be chosen automatically. Call get_checkpoints() to get a list of all available checkpoints.
  • model_dir (str, default None) – Directory to restore checkpoint from. If None, use self.model_dir.
  • session (tf.Session(), default None) – Session to run restore ops under. If None, self.session is used.
save()[source]

Dispatcher function for saving.

Each subclass is responsible for overriding this method.

save_checkpoint(max_checkpoints_to_keep=5, model_dir=None)[source]

Save a checkpoint to disk.

Usually you do not need to call this method, since fit() saves checkpoints automatically. If you have disabled automatic checkpointing during fitting, this can be called to manually write checkpoints.

Parameters:
  • max_checkpoints_to_keep (int) – the maximum number of checkpoints to keep. Older checkpoints are discarded.
  • model_dir (str, default None) – Model directory to save checkpoint to. If None, revert to self.model_dir
set_params(**params)[source]

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters:**params (dict) – Estimator parameters.
Returns:self – Estimator instance.
Return type:object

TextCNNModel

class deepchem.models.TextCNNModel(n_tasks, char_dict, seq_length, n_embedding=75, kernel_sizes=[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 15, 20], num_filters=[100, 200, 200, 200, 200, 100, 100, 100, 100, 100, 160, 160], dropout=0.25, mode='classification', **kwargs)[source]

A Convolutional neural network on smiles strings Reimplementation of the discriminator module in ORGAN: https://arxiv.org/abs/1705.10843 Originated from: http://emnlp2014.org/papers/pdf/EMNLP2014181.pdf

This model applies multiple 1D convolutional filters to the padded strings, then max-over-time pooling is applied on all filters, extracting one feature per filter. All features are concatenated and transformed through several hidden layers to form predictions.

This model is initially developed for sentence-level classification tasks, with words represented as vectors. In this implementation, SMILES strings are dissected into characters and transformed to one-hot vectors in a similar way. The model can be used for general molecular-level classification or regression tasks. It is also used in the ORGAN model as discriminator.

Training of the model only requires SMILES strings input, all featurized datasets that include SMILES in the ids attribute are accepted. PDBbind, QM7 and QM7b are not supported. To use the model, build_char_dict should be called first before defining the model to build character dict of input dataset, example can be found in examples/delaney/delaney_textcnn.py

__init__(n_tasks, char_dict, seq_length, n_embedding=75, kernel_sizes=[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 15, 20], num_filters=[100, 200, 200, 200, 200, 100, 100, 100, 100, 100, 160, 160], dropout=0.25, mode='classification', **kwargs)[source]
Parameters:
  • n_tasks (int) – Number of tasks
  • char_dict (dict) – Mapping from characters in smiles to integers
  • seq_length (int) – Length of sequences(after padding)
  • n_embedding (int, optional) – Length of embedding vector
  • filter_sizes (list of int, optional) – Properties of filters used in the conv net
  • num_filters (list of int, optional) – Properties of filters used in the conv net
  • dropout (float, optional) – Dropout rate
  • mode (str) – Either “classification” or “regression” for type of model.
static build_char_dict(dataset, default_dict={'#': 1, '(': 2, ')': 3, '+': 4, '-': 5, '/': 6, '1': 7, '2': 8, '3': 9, '4': 10, '5': 11, '6': 12, '7': 13, '8': 14, '=': 15, 'Br': 30, 'C': 16, 'Cl': 29, 'F': 17, 'H': 18, 'I': 19, 'N': 20, 'O': 21, 'P': 22, 'S': 23, '[': 24, '\\': 25, ']': 26, '_': 27, 'c': 28, 'n': 31, 'o': 32, 's': 33})[source]

Collect all unique characters(in smiles) from the dataset. This method should be called before defining the model to build appropriate char_dict

compute_saliency(X)[source]

Compute the saliency map for an input sample.

This computes the Jacobian matrix with the derivative of each output element with respect to each input element. More precisely,

  • If this model has a single output, it returns a matrix of shape (output_shape, input_shape) with the derivatives.
  • If this model has multiple outputs, it returns a list of matrices, one for each output.

This method cannot be used on models that take multiple inputs.

Parameters:X (ndarray) – the input data for a single sample
Returns:
Return type:the Jacobian matrix, or a list of matrices
default_generator(dataset, epochs=1, mode='fit', deterministic=True, pad_batches=True)[source]

Transfer smiles strings to fixed length integer vectors

evaluate(dataset, metrics, transformers=[], per_task_metrics=False)[source]

Evaluates the performance of this model on specified dataset.

Parameters:
  • dataset (dc.data.Dataset) – Dataset object.
  • metric (deepchem.metrics.Metric) – Evaluation metric
  • transformers (list) – List of deepchem.transformers.Transformer
  • per_task_metrics (bool) – If True, return per-task scores.
Returns:

Maps tasks to scores under metric.

Return type:

dict

evaluate_generator(generator, metrics, transformers=[], per_task_metrics=False)[source]

Evaluate the performance of this model on the data produced by a generator.

Parameters:
  • generator (generator) – this should generate batches, each represented as a tuple of the form (inputs, labels, weights).
  • metric (deepchem.metrics.Metric) – Evaluation metric
  • transformers (list of dc.trans.Transformers) – Transformers that the input data has been transformed by. The output is passed through these transformers to undo the transformations.
  • per_task_metrics (bool) – If True, return per-task scores.
Returns:

Maps tasks to scores under metric.

Return type:

dict

fit(dataset, nb_epoch=10, max_checkpoints_to_keep=5, checkpoint_interval=1000, deterministic=False, restore=False, variables=None, loss=None, callbacks=[])[source]

Train this model on a dataset.

Parameters:
  • dataset (Dataset) – the Dataset to train on
  • nb_epoch (int) – the number of epochs to train for
  • max_checkpoints_to_keep (int) – the maximum number of checkpoints to keep. Older checkpoints are discarded.
  • checkpoint_interval (int) – the frequency at which to write checkpoints, measured in training steps. Set this to 0 to disable automatic checkpointing.
  • deterministic (bool) – if True, the samples are processed in order. If False, a different random order is used for each epoch.
  • restore (bool) – if True, restore the model from the most recent checkpoint and continue training from there. If False, retrain the model from scratch.
  • variables (list of tf.Variable) – the variables to train. If None (the default), all trainable variables in the model are used.
  • loss (function) – a function of the form f(outputs, labels, weights) that computes the loss for each batch. If None (the default), the model’s standard loss function is used.
  • callbacks (function or list of functions) – one or more functions of the form f(model, step) that will be invoked after every step. This can be used to perform validation, logging, etc.
fit_generator(generator, max_checkpoints_to_keep=5, checkpoint_interval=1000, restore=False, variables=None, loss=None, callbacks=[])[source]

Train this model on data from a generator.

Parameters:
  • generator (generator) – this should generate batches, each represented as a tuple of the form (inputs, labels, weights).
  • max_checkpoints_to_keep (int) – the maximum number of checkpoints to keep. Older checkpoints are discarded.
  • checkpoint_interval (int) – the frequency at which to write checkpoints, measured in training steps. Set this to 0 to disable automatic checkpointing.
  • restore (bool) – if True, restore the model from the most recent checkpoint and continue training from there. If False, retrain the model from scratch.
  • variables (list of tf.Variable) – the variables to train. If None (the default), all trainable variables in the model are used.
  • loss (function) – a function of the form f(outputs, labels, weights) that computes the loss for each batch. If None (the default), the model’s standard loss function is used.
  • callbacks (function or list of functions) – one or more functions of the form f(model, step) that will be invoked after every step. This can be used to perform validation, logging, etc.
Returns:

Return type:

the average loss over the most recent checkpoint interval

fit_on_batch(X, y, w, variables=None, loss=None, callbacks=[], checkpoint=True, max_checkpoints_to_keep=5)[source]

Perform a single step of training.

Parameters:
  • X (ndarray) – the inputs for the batch
  • y (ndarray) – the labels for the batch
  • w (ndarray) – the weights for the batch
  • variables (list of tf.Variable) – the variables to train. If None (the default), all trainable variables in the model are used.
  • loss (function) – a function of the form f(outputs, labels, weights) that computes the loss for each batch. If None (the default), the model’s standard loss function is used.
  • callbacks (function or list of functions) – one or more functions of the form f(model, step) that will be invoked after every step. This can be used to perform validation, logging, etc.
  • checkpoint (bool) – if true, save a checkpoint after performing the training step
  • max_checkpoints_to_keep (int) – the maximum number of checkpoints to keep. Older checkpoints are discarded.
get_checkpoints(model_dir=None)[source]

Get a list of all available checkpoint files.

Parameters:model_dir (str, default None) – Directory to get list of checkpoints from. Reverts to self.model_dir if None
get_global_step()[source]

Get the number of steps of fitting that have been performed.

static get_model_filename(model_dir)[source]

Given model directory, obtain filename for the model itself.

get_num_tasks()[source]

Get number of tasks.

get_params(deep=True)[source]

Get parameters for this estimator.

Parameters:deep (bool, default=True) – If True, will return the parameters for this estimator and contained subobjects that are estimators.
Returns:params – Parameter names mapped to their values.
Return type:mapping of string to any
static get_params_filename(model_dir)[source]

Given model directory, obtain filename for the model itself.

get_task_type()[source]

Currently models can only be classifiers or regressors.

load_from_pretrained(source_model, assignment_map=None, value_map=None, checkpoint=None, model_dir=None, include_top=True, inputs=None, **kwargs)[source]

Copies variable values from a pretrained model. source_model can either be a pretrained model or a model with the same architecture. value_map is a variable-value dictionary. If no value_map is provided, the variable values are restored to the source_model from a checkpoint and a default value_map is created. assignment_map is a dictionary mapping variables from the source_model to the current model. If no assignment_map is provided, one is made from scratch and assumes the model is composed of several different layers, with the final one being a dense layer. include_top is used to control whether or not the final dense layer is used. The default assignment map is useful in cases where the type of task is different (classification vs regression) and/or number of tasks in the setting.

Parameters:
  • source_model (dc.KerasModel, required) – source_model can either be the pretrained model or a dc.KerasModel with the same architecture as the pretrained model. It is used to restore from a checkpoint, if value_map is None and to create a default assignment map if assignment_map is None
  • assignment_map (Dict, default None) – Dictionary mapping the source_model variables and current model variables
  • value_map (Dict, default None) – Dictionary containing source_model trainable variables mapped to numpy arrays. If value_map is None, the values are restored and a default variable map is created using the restored values
  • checkpoint (str, default None) – the path to the checkpoint file to load. If this is None, the most recent checkpoint will be chosen automatically. Call get_checkpoints() to get a list of all available checkpoints
  • model_dir (str, default None) – Restore model from custom model directory if needed
  • include_top (bool, default True) – if True, copies the weights and bias associated with the final dense layer. Used only when assignment map is None
  • inputs (List, input tensors for model) – if not None, then the weights are built for both the source and self. This option is useful only for models that are built by subclassing tf.keras.Model, and not using the functional API by tf.keras
predict(dataset, transformers=[], outputs=None, output_types=None)[source]

Uses self to make predictions on provided Dataset object.

Parameters:
  • dataset (dc.data.Dataset) – Dataset to make prediction on
  • transformers (list of dc.trans.Transformers) – Transformers that the input data has been transformed by. The output is passed through these transformers to undo the transformations.
  • outputs (Tensor or list of Tensors) – The outputs to return. If this is None, the model’s standard prediction outputs will be returned. Alternatively one or more Tensors within the model may be specified, in which case the output of those Tensors will be returned.
  • output_types (list of Strings) – The output types to return. Will retrieve all outputs of these types from the model.
Returns:

  • a NumPy array of the model produces a single output, or a list of arrays
  • if it produces multiple outputs

predict_embedding(dataset)[source]

Predicts embeddings created by underlying model if any exist. An embedding must be specified to have output_type of ‘embedding’ in the model definition.

Parameters:dataset (dc.data.Dataset) – Dataset to make prediction on
Returns:
  • a NumPy array of the embeddings model produces, or a list
  • of arrays if it produces multiple embeddings
predict_on_batch(X, transformers=[], outputs=None)[source]

Generates predictions for input samples, processing samples in a batch.

Parameters:
  • X (ndarray) – the input data, as a Numpy array.
  • transformers (list of dc.trans.Transformers) – Transformers that the input data has been transformed by. The output is passed through these transformers to undo the transformations.
  • outputs (Tensor or list of Tensors) – The outputs to return. If this is None, the model’s standard prediction outputs will be returned. Alternatively one or more Tensors within the model may be specified, in which case the output of those Tensors will be returned.
Returns:

  • a NumPy array of the model produces a single output, or a list of arrays
  • if it produces multiple outputs

predict_on_generator(generator, transformers=[], outputs=None, output_types=None)[source]
Parameters:
  • generator (generator) – this should generate batches, each represented as a tuple of the form (inputs, labels, weights).
  • transformers (list of dc.trans.Transformers) – Transformers that the input data has been transformed by. The output is passed through these transformers to undo the transformations.
  • outputs (Tensor or list of Tensors) – The outputs to return. If this is None, the model’s standard prediction outputs will be returned. Alternatively one or more Tensors within the model may be specified, in which case the output of those Tensors will be returned. If outputs is specified, output_types must be None.
  • output_types (String or list of Strings) – If specified, all outputs of this type will be retrieved from the model. If output_types is specified, outputs must be None.
  • Returns – a NumPy array of the model produces a single output, or a list of arrays if it produces multiple outputs
predict_uncertainty(dataset, masks=50)[source]

Predict the model’s outputs, along with the uncertainty in each one.

The uncertainty is computed as described in https://arxiv.org/abs/1703.04977. It involves repeating the prediction many times with different dropout masks. The prediction is computed as the average over all the predictions. The uncertainty includes both the variation among the predicted values (epistemic uncertainty) and the model’s own estimates for how well it fits the data (aleatoric uncertainty). Not all models support uncertainty prediction.

Parameters:
  • dataset (dc.data.Dataset) – Dataset to make prediction on
  • masks (int) – the number of dropout masks to average over
Returns:

  • for each output, a tuple (y_pred, y_std) where y_pred is the predicted
  • value of the output, and each element of y_std estimates the standard
  • deviation of the corresponding element of y_pred

predict_uncertainty_on_batch(X, masks=50)[source]

Predict the model’s outputs, along with the uncertainty in each one.

The uncertainty is computed as described in https://arxiv.org/abs/1703.04977. It involves repeating the prediction many times with different dropout masks. The prediction is computed as the average over all the predictions. The uncertainty includes both the variation among the predicted values (epistemic uncertainty) and the model’s own estimates for how well it fits the data (aleatoric uncertainty). Not all models support uncertainty prediction.

Parameters:
  • X (ndarray) – the input data, as a Numpy array.
  • masks (int) – the number of dropout masks to average over
Returns:

  • for each output, a tuple (y_pred, y_std) where y_pred is the predicted
  • value of the output, and each element of y_std estimates the standard
  • deviation of the corresponding element of y_pred

reload()[source]

Reload trained model from disk.

restore(checkpoint=None, model_dir=None, session=None)[source]

Reload the values of all variables from a checkpoint file.

Parameters:
  • checkpoint (str) – the path to the checkpoint file to load. If this is None, the most recent checkpoint will be chosen automatically. Call get_checkpoints() to get a list of all available checkpoints.
  • model_dir (str, default None) – Directory to restore checkpoint from. If None, use self.model_dir.
  • session (tf.Session(), default None) – Session to run restore ops under. If None, self.session is used.
save()[source]

Dispatcher function for saving.

Each subclass is responsible for overriding this method.

save_checkpoint(max_checkpoints_to_keep=5, model_dir=None)[source]

Save a checkpoint to disk.

Usually you do not need to call this method, since fit() saves checkpoints automatically. If you have disabled automatic checkpointing during fitting, this can be called to manually write checkpoints.

Parameters:
  • max_checkpoints_to_keep (int) – the maximum number of checkpoints to keep. Older checkpoints are discarded.
  • model_dir (str, default None) – Model directory to save checkpoint to. If None, revert to self.model_dir
set_params(**params)[source]

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters:**params (dict) – Estimator parameters.
Returns:self – Estimator instance.
Return type:object
smiles_to_seq(smiles)[source]

Tokenize characters in smiles to integers

smiles_to_seq_batch(ids_b)[source]

Converts SMILES strings to np.array sequence.

A tf.py_func wrapper is written around this when creating the input_fn for make_estimator

AtomicConvModel

class deepchem.models.AtomicConvModel(frag1_num_atoms=70, frag2_num_atoms=634, complex_num_atoms=701, max_num_neighbors=12, batch_size=24, atom_types=[6, 7.0, 8.0, 9.0, 11.0, 12.0, 15.0, 16.0, 17.0, 20.0, 25.0, 30.0, 35.0, 53.0, -1.0], radial=[[1.5, 2.0, 2.5, 3.0, 3.5, 4.0, 4.5, 5.0, 5.5, 6.0, 6.5, 7.0, 7.5, 8.0, 8.5, 9.0, 9.5, 10.0, 10.5, 11.0, 11.5, 12.0], [0.0, 4.0, 8.0], [0.4]], layer_sizes=[32, 32, 16], learning_rate=0.001, **kwargs)[source]

Implements an Atomic Convolution Model.

Implements the atomic convolutional networks as introduced in

Gomes, Joseph, et al. “Atomic convolutional networks for predicting protein-ligand binding affinity.” arXiv preprint arXiv:1703.10603 (2017).

The atomic convolutional networks function as a variant of graph convolutions. The difference is that the “graph” here is the nearest neighbors graph in 3D space. The AtomicConvModel leverages these connections in 3D space to train models that learn to predict energetic state starting from the spatial geometry of the model.

__init__(frag1_num_atoms=70, frag2_num_atoms=634, complex_num_atoms=701, max_num_neighbors=12, batch_size=24, atom_types=[6, 7.0, 8.0, 9.0, 11.0, 12.0, 15.0, 16.0, 17.0, 20.0, 25.0, 30.0, 35.0, 53.0, -1.0], radial=[[1.5, 2.0, 2.5, 3.0, 3.5, 4.0, 4.5, 5.0, 5.5, 6.0, 6.5, 7.0, 7.5, 8.0, 8.5, 9.0, 9.5, 10.0, 10.5, 11.0, 11.5, 12.0], [0.0, 4.0, 8.0], [0.4]], layer_sizes=[32, 32, 16], learning_rate=0.001, **kwargs)[source]
Parameters:
  • frag1_num_atoms (int) – Number of atoms in first fragment
  • frag2_num_atoms (int) – Number of atoms in sec
  • max_num_neighbors (int) – Maximum number of neighbors possible for an atom. Recall neighbors are spatial neighbors.
  • atom_types (list) – List of atoms recognized by model. Atoms are indicated by their nuclear numbers.
  • radial (list) – TODO: add description
  • layer_sizes (list) – TODO: add description
  • learning_rate (float) – Learning rate for the model.
compute_saliency(X)[source]

Compute the saliency map for an input sample.

This computes the Jacobian matrix with the derivative of each output element with respect to each input element. More precisely,

  • If this model has a single output, it returns a matrix of shape (output_shape, input_shape) with the derivatives.
  • If this model has multiple outputs, it returns a list of matrices, one for each output.

This method cannot be used on models that take multiple inputs.

Parameters:X (ndarray) – the input data for a single sample
Returns:
Return type:the Jacobian matrix, or a list of matrices
default_generator(dataset, epochs=1, mode='fit', deterministic=True, pad_batches=True)[source]

Create a generator that iterates batches for a dataset.

Subclasses may override this method to customize how model inputs are generated from the data.

Parameters:
  • dataset (Dataset) – the data to iterate
  • epochs (int) – the number of times to iterate over the full dataset
  • mode (str) – allowed values are ‘fit’ (called during training), ‘predict’ (called during prediction), and ‘uncertainty’ (called during uncertainty prediction)
  • deterministic (bool) – whether to iterate over the dataset in order, or randomly shuffle the data for each epoch
  • pad_batches (bool) – whether to pad each batch up to this model’s preferred batch size
Returns:

  • a generator that iterates batches, each represented as a tuple of lists
  • ([inputs], [outputs], [weights])

evaluate(dataset, metrics, transformers=[], per_task_metrics=False)[source]

Evaluates the performance of this model on specified dataset.

Parameters:
  • dataset (dc.data.Dataset) – Dataset object.
  • metric (deepchem.metrics.Metric) – Evaluation metric
  • transformers (list) – List of deepchem.transformers.Transformer
  • per_task_metrics (bool) – If True, return per-task scores.
Returns:

Maps tasks to scores under metric.

Return type:

dict

evaluate_generator(generator, metrics, transformers=[], per_task_metrics=False)[source]

Evaluate the performance of this model on the data produced by a generator.

Parameters:
  • generator (generator) – this should generate batches, each represented as a tuple of the form (inputs, labels, weights).
  • metric (deepchem.metrics.Metric) – Evaluation metric
  • transformers (list of dc.trans.Transformers) – Transformers that the input data has been transformed by. The output is passed through these transformers to undo the transformations.
  • per_task_metrics (bool) – If True, return per-task scores.
Returns:

Maps tasks to scores under metric.

Return type:

dict

fit(dataset, nb_epoch=10, max_checkpoints_to_keep=5, checkpoint_interval=1000, deterministic=False, restore=False, variables=None, loss=None, callbacks=[])[source]

Train this model on a dataset.

Parameters:
  • dataset (Dataset) – the Dataset to train on
  • nb_epoch (int) – the number of epochs to train for
  • max_checkpoints_to_keep (int) – the maximum number of checkpoints to keep. Older checkpoints are discarded.
  • checkpoint_interval (int) – the frequency at which to write checkpoints, measured in training steps. Set this to 0 to disable automatic checkpointing.
  • deterministic (bool) – if True, the samples are processed in order. If False, a different random order is used for each epoch.
  • restore (bool) – if True, restore the model from the most recent checkpoint and continue training from there. If False, retrain the model from scratch.
  • variables (list of tf.Variable) – the variables to train. If None (the default), all trainable variables in the model are used.
  • loss (function) – a function of the form f(outputs, labels, weights) that computes the loss for each batch. If None (the default), the model’s standard loss function is used.
  • callbacks (function or list of functions) – one or more functions of the form f(model, step) that will be invoked after every step. This can be used to perform validation, logging, etc.
fit_generator(generator, max_checkpoints_to_keep=5, checkpoint_interval=1000, restore=False, variables=None, loss=None, callbacks=[])[source]

Train this model on data from a generator.

Parameters:
  • generator (generator) – this should generate batches, each represented as a tuple of the form (inputs, labels, weights).
  • max_checkpoints_to_keep (int) – the maximum number of checkpoints to keep. Older checkpoints are discarded.
  • checkpoint_interval (int) – the frequency at which to write checkpoints, measured in training steps. Set this to 0 to disable automatic checkpointing.
  • restore (bool) – if True, restore the model from the most recent checkpoint and continue training from there. If False, retrain the model from scratch.
  • variables (list of tf.Variable) – the variables to train. If None (the default), all trainable variables in the model are used.
  • loss (function) – a function of the form f(outputs, labels, weights) that computes the loss for each batch. If None (the default), the model’s standard loss function is used.
  • callbacks (function or list of functions) – one or more functions of the form f(model, step) that will be invoked after every step. This can be used to perform validation, logging, etc.
Returns:

Return type:

the average loss over the most recent checkpoint interval

fit_on_batch(X, y, w, variables=None, loss=None, callbacks=[], checkpoint=True, max_checkpoints_to_keep=5)[source]

Perform a single step of training.

Parameters:
  • X (ndarray) – the inputs for the batch
  • y (ndarray) – the labels for the batch
  • w (ndarray) – the weights for the batch
  • variables (list of tf.Variable) – the variables to train. If None (the default), all trainable variables in the model are used.
  • loss (function) – a function of the form f(outputs, labels, weights) that computes the loss for each batch. If None (the default), the model’s standard loss function is used.
  • callbacks (function or list of functions) – one or more functions of the form f(model, step) that will be invoked after every step. This can be used to perform validation, logging, etc.
  • checkpoint (bool) – if true, save a checkpoint after performing the training step
  • max_checkpoints_to_keep (int) – the maximum number of checkpoints to keep. Older checkpoints are discarded.
get_checkpoints(model_dir=None)[source]

Get a list of all available checkpoint files.

Parameters:model_dir (str, default None) – Directory to get list of checkpoints from. Reverts to self.model_dir if None
get_global_step()[source]

Get the number of steps of fitting that have been performed.

static get_model_filename(model_dir)[source]

Given model directory, obtain filename for the model itself.

get_num_tasks()[source]

Get number of tasks.

get_params(deep=True)[source]

Get parameters for this estimator.

Parameters:deep (bool, default=True) – If True, will return the parameters for this estimator and contained subobjects that are estimators.
Returns:params – Parameter names mapped to their values.
Return type:mapping of string to any
static get_params_filename(model_dir)[source]

Given model directory, obtain filename for the model itself.

get_task_type()[source]

Currently models can only be classifiers or regressors.

load_from_pretrained(source_model, assignment_map=None, value_map=None, checkpoint=None, model_dir=None, include_top=True, inputs=None, **kwargs)[source]

Copies variable values from a pretrained model. source_model can either be a pretrained model or a model with the same architecture. value_map is a variable-value dictionary. If no value_map is provided, the variable values are restored to the source_model from a checkpoint and a default value_map is created. assignment_map is a dictionary mapping variables from the source_model to the current model. If no assignment_map is provided, one is made from scratch and assumes the model is composed of several different layers, with the final one being a dense layer. include_top is used to control whether or not the final dense layer is used. The default assignment map is useful in cases where the type of task is different (classification vs regression) and/or number of tasks in the setting.

Parameters:
  • source_model (dc.KerasModel, required) – source_model can either be the pretrained model or a dc.KerasModel with the same architecture as the pretrained model. It is used to restore from a checkpoint, if value_map is None and to create a default assignment map if assignment_map is None
  • assignment_map (Dict, default None) – Dictionary mapping the source_model variables and current model variables
  • value_map (Dict, default None) – Dictionary containing source_model trainable variables mapped to numpy arrays. If value_map is None, the values are restored and a default variable map is created using the restored values
  • checkpoint (str, default None) – the path to the checkpoint file to load. If this is None, the most recent checkpoint will be chosen automatically. Call get_checkpoints() to get a list of all available checkpoints
  • model_dir (str, default None) – Restore model from custom model directory if needed
  • include_top (bool, default True) – if True, copies the weights and bias associated with the final dense layer. Used only when assignment map is None
  • inputs (List, input tensors for model) – if not None, then the weights are built for both the source and self. This option is useful only for models that are built by subclassing tf.keras.Model, and not using the functional API by tf.keras
predict(dataset, transformers=[], outputs=None, output_types=None)[source]

Uses self to make predictions on provided Dataset object.

Parameters:
  • dataset (dc.data.Dataset) – Dataset to make prediction on
  • transformers (list of dc.trans.Transformers) – Transformers that the input data has been transformed by. The output is passed through these transformers to undo the transformations.
  • outputs (Tensor or list of Tensors) – The outputs to return. If this is None, the model’s standard prediction outputs will be returned. Alternatively one or more Tensors within the model may be specified, in which case the output of those Tensors will be returned.
  • output_types (list of Strings) – The output types to return. Will retrieve all outputs of these types from the model.
Returns:

  • a NumPy array of the model produces a single output, or a list of arrays
  • if it produces multiple outputs

predict_embedding(dataset)[source]

Predicts embeddings created by underlying model if any exist. An embedding must be specified to have output_type of ‘embedding’ in the model definition.

Parameters:dataset (dc.data.Dataset) – Dataset to make prediction on
Returns:
  • a NumPy array of the embeddings model produces, or a list
  • of arrays if it produces multiple embeddings
predict_on_batch(X, transformers=[], outputs=None)[source]

Generates predictions for input samples, processing samples in a batch.

Parameters:
  • X (ndarray) – the input data, as a Numpy array.
  • transformers (list of dc.trans.Transformers) – Transformers that the input data has been transformed by. The output is passed through these transformers to undo the transformations.
  • outputs (Tensor or list of Tensors) – The outputs to return. If this is None, the model’s standard prediction outputs will be returned. Alternatively one or more Tensors within the model may be specified, in which case the output of those Tensors will be returned.
Returns:

  • a NumPy array of the model produces a single output, or a list of arrays
  • if it produces multiple outputs

predict_on_generator(generator, transformers=[], outputs=None, output_types=None)[source]
Parameters:
  • generator (generator) – this should generate batches, each represented as a tuple of the form (inputs, labels, weights).
  • transformers (list of dc.trans.Transformers) – Transformers that the input data has been transformed by. The output is passed through these transformers to undo the transformations.
  • outputs (Tensor or list of Tensors) – The outputs to return. If this is None, the model’s standard prediction outputs will be returned. Alternatively one or more Tensors within the model may be specified, in which case the output of those Tensors will be returned. If outputs is specified, output_types must be None.
  • output_types (String or list of Strings) – If specified, all outputs of this type will be retrieved from the model. If output_types is specified, outputs must be None.
  • Returns – a NumPy array of the model produces a single output, or a list of arrays if it produces multiple outputs
predict_uncertainty(dataset, masks=50)[source]

Predict the model’s outputs, along with the uncertainty in each one.

The uncertainty is computed as described in https://arxiv.org/abs/1703.04977. It involves repeating the prediction many times with different dropout masks. The prediction is computed as the average over all the predictions. The uncertainty includes both the variation among the predicted values (epistemic uncertainty) and the model’s own estimates for how well it fits the data (aleatoric uncertainty). Not all models support uncertainty prediction.

Parameters:
  • dataset (dc.data.Dataset) – Dataset to make prediction on
  • masks (int) – the number of dropout masks to average over
Returns:

  • for each output, a tuple (y_pred, y_std) where y_pred is the predicted
  • value of the output, and each element of y_std estimates the standard
  • deviation of the corresponding element of y_pred

predict_uncertainty_on_batch(X, masks=50)[source]

Predict the model’s outputs, along with the uncertainty in each one.

The uncertainty is computed as described in https://arxiv.org/abs/1703.04977. It involves repeating the prediction many times with different dropout masks. The prediction is computed as the average over all the predictions. The uncertainty includes both the variation among the predicted values (epistemic uncertainty) and the model’s own estimates for how well it fits the data (aleatoric uncertainty). Not all models support uncertainty prediction.

Parameters:
  • X (ndarray) – the input data, as a Numpy array.
  • masks (int) – the number of dropout masks to average over
Returns:

  • for each output, a tuple (y_pred, y_std) where y_pred is the predicted
  • value of the output, and each element of y_std estimates the standard
  • deviation of the corresponding element of y_pred

reload()[source]

Reload trained model from disk.

restore(checkpoint=None, model_dir=None, session=None)[source]

Reload the values of all variables from a checkpoint file.

Parameters:
  • checkpoint (str) – the path to the checkpoint file to load. If this is None, the most recent checkpoint will be chosen automatically. Call get_checkpoints() to get a list of all available checkpoints.
  • model_dir (str, default None) – Directory to restore checkpoint from. If None, use self.model_dir.
  • session (tf.Session(), default None) – Session to run restore ops under. If None, self.session is used.
save()[source]

Dispatcher function for saving.

Each subclass is responsible for overriding this method.

save_checkpoint(max_checkpoints_to_keep=5, model_dir=None)[source]

Save a checkpoint to disk.

Usually you do not need to call this method, since fit() saves checkpoints automatically. If you have disabled automatic checkpointing during fitting, this can be called to manually write checkpoints.

Parameters:
  • max_checkpoints_to_keep (int) – the maximum number of checkpoints to keep. Older checkpoints are discarded.
  • model_dir (str, default None) – Model directory to save checkpoint to. If None, revert to self.model_dir
set_params(**params)[source]

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters:**params (dict) – Estimator parameters.
Returns:self – Estimator instance.
Return type:object

Smiles2Vec

class deepchem.models.Smiles2Vec(char_to_idx, n_tasks=10, max_seq_len=270, embedding_dim=50, n_classes=2, use_bidir=True, use_conv=True, filters=192, kernel_size=3, strides=1, rnn_sizes=[224, 384], rnn_types=['GRU', 'GRU'], mode='regression', **kwargs)[source]

Implements the Smiles2Vec model, that learns neural representations of SMILES strings which can be used for downstream tasks.

The model is based on the description in Goh et al., “SMILES2vec: An Interpretable General-Purpose Deep Neural Network for Predicting Chemical Properties” (https://arxiv.org/pdf/1712.02034.pdf). The goal here is to take SMILES strings as inputs, turn them into vector representations which can then be used in predicting molecular properties.

The model consists of an Embedding layer that retrieves embeddings for each character in the SMILES string. These embeddings are learnt jointly with the rest of the model. The output from the embedding layer is a tensor of shape (batch_size, seq_len, embedding_dim). This tensor can optionally be fed through a 1D convolutional layer, before being passed to a series of RNN cells (optionally bidirectional). The final output from the RNN cells aims to have learnt the temporal dependencies in the SMILES string, and in turn information about the structure of the molecule, which is then used for molecular property prediction.

In the paper, the authors also train an explanation mask to endow the model with interpretability and gain insights into its decision making. This segment is currently not a part of this implementation as this was developed for the purpose of investigating a transfer learning protocol, ChemNet (which can be found at https://arxiv.org/abs/1712.02734).

__init__(char_to_idx, n_tasks=10, max_seq_len=270, embedding_dim=50, n_classes=2, use_bidir=True, use_conv=True, filters=192, kernel_size=3, strides=1, rnn_sizes=[224, 384], rnn_types=['GRU', 'GRU'], mode='regression', **kwargs)[source]
Parameters:
  • char_to_idx (dict,) – char_to_idx contains character to index mapping for SMILES characters
  • embedding_dim (int, default 50) – Size of character embeddings used.
  • use_bidir (bool, default True) – Whether to use BiDirectional RNN Cells
  • use_conv (bool, default True) – Whether to use a conv-layer
  • kernel_size (int, default 3) – Kernel size for convolutions
  • filters (int, default 192) – Number of filters
  • strides (int, default 1) – Strides used in convolution
  • rnn_sizes (list[int], default [224, 384]) – Number of hidden units in the RNN cells
  • mode (str, default regression) – Whether to use model for regression or classification