Metrics are one of the most import parts of machine learning. Unlike traditional software, in which algorithms either work or don’t work, machine learning models work in degrees. That is, there’s a continuous range of “goodness” for a model. “Metrics” are functions which measure how well a model works. There are many different choices of metrics depending on the type of model at hand.

Metric Utilities

deepchem.metrics.to_one_hot(y, n_classes=2)[source]

Transforms label vector into one-hot encoding.

Turns y into vector of shape (n_samples, n_classes) with a one-hot encoding.

Parameters:y (np.ndarray) – A vector of shape (n_samples, 1)
Return type:A numpy.ndarray of shape (n_samples, n_classes).
deepchem.metrics.from_one_hot(y, axis=1)[source]

Transorms label vector from one-hot encoding.

  • y (np.ndarray) – A vector of shape (n_samples, num_classes)
  • axis (int, optional (default 1)) – The axis with one-hot encodings to reduce on.

Return type:

A numpy.ndarray of shape (n_samples,)

Metric Functions

deepchem.metrics.roc_auc_score(y, y_pred)[source]

Area under the receiver operating characteristic curve.

deepchem.metrics.accuracy_score(y, y_pred)[source]

Compute accuracy score

Computes accuracy score for classification tasks. Works for both binary and multiclass classification.

  • y (np.ndarray) – Of shape (N_samples,)
  • y_pred (np.ndarray) – Of shape (N_samples,)

score – The fraction of correctly classified samples. A number between 0 and 1.

Return type:


deepchem.metrics.balanced_accuracy_score(y, y_pred)[source]

Computes balanced accuracy score.

deepchem.metrics.pearson_r2_score(y, y_pred)[source]

Computes Pearson R^2 (square of Pearson correlation).

deepchem.metrics.jaccard_index(y, y_pred)[source]

Computes Jaccard Index which is the Intersection Over Union metric which is commonly used in image segmentation tasks

  • y (ground truth array) –
  • y_pred (predicted array) –
deepchem.metrics.pixel_error(y, y_pred)[source]

An error metric in case y, y_pred are images.

Defined as 1 - the maximal F-score of pixel similarity, or squared Euclidean distance between the original and the result labels.

  • y (np.ndarray) – ground truth array
  • y_pred (np.ndarray) – predicted array
deepchem.metrics.prc_auc_score(y, y_pred)[source]

Compute area under precision-recall curve

deepchem.metrics.rms_score(y_true, y_pred)[source]

Computes RMS error.

deepchem.metrics.mae_score(y_true, y_pred)[source]

Computes MAE.

deepchem.metrics.kappa_score(y_true, y_pred)[source]

Calculate Cohen’s kappa for classification tasks.


Note that this implementation of Cohen’s kappa expects binary labels.

  • y_true (np.ndarray) – Numpy array containing true values.
  • y_pred (np.ndarray) – Numpy array containing predicted values.

kappa – Numpy array containing kappa for each classification task.

Return type:


  • AssertionError: If y_true and y_pred are not the same size, or if
  • class labels are not in [0, 1].
deepchem.metrics.bedroc_score(y_true, y_pred, alpha=20.0)[source]

BEDROC metric implemented according to Truchon and Bayley that modifies the ROC score by allowing for a factor of early recognition

  • (array_like) (y_pred) – Binary class labels. 1 for positive class, 0 otherwise
  • (array_like) – Predicted labels
  • (float), default 20.0 (alpha) – Early recognition parameter


Return type:

Value in [0, 1] that indicates the degree of early recognition


The original paper by Truchon et al. is located at

deepchem.metrics.genomic_metrics.get_motif_scores(encoded_sequences, motif_names, max_scores=None, return_positions=False, GC_fraction=0.4)[source]

Computes pwm log odds.

  • encoded_sequences (4darray) – (N_sequences, N_letters, sequence_length, 1) array
  • motif_names (list of strings) –
  • max_scores (int, optional) –
  • return_positions (boolean, optional) –
  • GC_fraction (float, optional) –

  • (N_sequences, num_motifs, seq_length) complete score array by default.
  • If max_scores, (N_sequences, num_motifs*max_scores) max score array.
  • If max_scores and return_positions, (N_sequences, 2*num_motifs*max_scores)
  • array with max scores and their positions.

deepchem.metrics.genomic_metrics.get_pssm_scores(encoded_sequences, pssm)[source]

Convolves pssm and its reverse complement with encoded sequences and returns the maximum score at each position of each sequence.

  • encoded_sequences (3darray) – (N_sequences, N_letters, sequence_length, 1) array
  • pssm (2darray) – (4, pssm_length) array

scores – (N_sequences, sequence_length)

Return type:


deepchem.metrics.genomic_metrics.in_silico_mutagenesis(model, X)[source]

Computes in-silico-mutagenesis scores

  • model (Model) – This can be any model that accepts inputs of the required shape and produces an output of shape (N_sequences, N_tasks).
  • X (ndarray) – Shape (N_sequences, N_letters, sequence_length, 1)

Return type:

(num_task, N_sequences, N_letters, sequence_length, 1) ISM score array.

Metric Class

The dc.metrics.Metric class is a wrapper around metric functions which interoperates with DeepChem dc.models.Model.

class deepchem.metrics.Metric(metric, task_averager=None, name=None, threshold=None, mode=None, compute_energy_metric=False)[source]

Wrapper class for computing user-defined metrics.

There are a variety of different metrics this class aims to support. At the most simple, metrics for classification and regression that assume that values to compare are scalars. More complicated, there may perhaps be two image arrays that need to be compared.

The Metric class provides a wrapper for standardizing the API around different classes of metrics that may be useful for DeepChem models. The implementation provides a few non-standard conveniences such as built-in support for multitask and multiclass metrics, and support for multidimensional outputs.

__init__(metric, task_averager=None, name=None, threshold=None, mode=None, compute_energy_metric=False)[source]
  • metric (function) – function that takes args y_true, y_pred (in that order) and computes desired score.
  • task_averager (function, optional) – If not None, should be a function that averages metrics across tasks. For example, task_averager=np.mean. If task_averager is provided, this task will be inherited as a multitask metric.
  • name (str, optional) – Name of this metric
  • threshold (float, optional) – Used for binary metrics and is the threshold for the positive class
  • mode (str, optional) – Must be either classification or regression.
  • compute_energy_metric (TODO(rbharath): Should this be removed?) –
compute_metric(y_true, y_pred, w=None, n_classes=2, filter_nans=True, per_task_metrics=False)[source]

Compute a performance metric for each task.

  • y_true (np.ndarray) – An np.ndarray containing true values for each task.
  • y_pred (np.ndarray) – An np.ndarray containing predicted values for each task.
  • w (np.ndarray, optional) – An np.ndarray containing weights for each datapoint.
  • n_classes (int, optional) – Number of classes in data for classification tasks.
  • filter_nans (bool, optional) – Remove NaN values in computed metrics
  • per_task_metrics (bool, optional) – If true, return computed metric for each task on multitask dataset.

Return type:

A numpy nd.array containing metric values for each task.

compute_singletask_metric(y_true, y_pred, w)[source]

Compute a metric value.

  • y_true (list) – A list of arrays containing true values for each task.
  • y_pred (list) – A list of arrays containing predicted values for each task.

Return type:

Float metric value.


NotImplementedError: If metric_str is not in METRICS.