Metrics¶
Metrics are one of the most important parts of machine learning. Unlike traditional software, in which algorithms either work or don’t work, machine learning models work in degrees. That is, there’s a continuous range of “goodness” for a model. “Metrics” are functions which measure how well a model works. There are many different choices of metrics depending on the type of model at hand.
Metric Utilities¶
Metric utility functions allow for some common manipulations such as switching to/from onehot representations.

to_one_hot
(y: numpy.ndarray, n_classes: int = 2) → numpy.ndarray[source]¶ Transforms label vector into onehot encoding.
Turns y into vector of shape (N, n_classes) with a onehot encoding. Assumes that y takes values from 0 to n_classes  1.
 Parameters
y (np.ndarray) – A vector of shape (N,) or (N, 1)
n_classes (int, default 2) – If specified use this as the number of classes. Else will try to impute it as n_classes = max(y) + 1 for arrays and as n_classes=2 for the case of scalars. Note this parameter only has value if mode==”classification”
 Returns
A numpy array of shape (N, n_classes).
 Return type
np.ndarray

from_one_hot
(y: numpy.ndarray, axis: int = 1) → numpy.ndarray[source]¶ Transforms label vector from onehot encoding.
 Parameters
y (np.ndarray) – A vector of shape (n_samples, num_classes)
axis (int, optional (default 1)) – The axis with onehot encodings to reduce on.
 Returns
A numpy array of shape (n_samples,)
 Return type
np.ndarray
Metric Shape Handling¶
One of the trickiest parts of handling metrics correctly is making sure the shapes of input weights, predictions and labels and processed correctly. This is challenging in particular since DeepChem supports multitask, multiclass models which means that shapes must be handled with care to prevent errors. DeepChem maintains the following utility functions which attempt to facilitate shape handling for you.

normalize_weight_shape
(w: numpy.ndarray, n_samples: int, n_tasks: int) → numpy.ndarray[source]¶ A utility function to correct the shape of the weight array.
This utility function is used to normalize the shapes of a given weight array.
 Parameters
w (np.ndarray) – w can be None or a scalar or a np.ndarray of shape (n_samples,) or of shape (n_samples, n_tasks). If w is a scalar, it’s assumed to be the same weight for all samples/tasks.
n_samples (int) – The number of samples in the dataset. If w is not None, we should have n_samples = w.shape[0] if w is a ndarray
n_tasks (int) – The number of tasks. If w is 2d ndarray, then we should have w.shape[1] == n_tasks.
Examples
>>> import numpy as np >>> w_out = normalize_weight_shape(None, n_samples=10, n_tasks=1) >>> (w_out == np.ones((10, 1))).all() True
 Returns
w_out – Array of shape (n_samples, n_tasks)
 Return type
np.ndarray

normalize_labels_shape
(y: numpy.ndarray, mode: Optional[str] = None, n_tasks: Optional[int] = None, n_classes: Optional[int] = None) → numpy.ndarray[source]¶ A utility function to correct the shape of the labels.
 Parameters
y (np.ndarray) – y is an array of shape (N,) or (N, n_tasks) or (N, n_tasks, 1).
mode (str, default None) – If mode is “classification” or “regression”, attempts to apply data transformations.
n_tasks (int, default None) – The number of tasks this class is expected to handle.
n_classes (int, default None) – If specified use this as the number of classes. Else will try to impute it as n_classes = max(y) + 1 for arrays and as n_classes=2 for the case of scalars. Note this parameter only has value if mode==”classification”
 Returns
y_out – If mode==”classification”, y_out is an array of shape (N, n_tasks, n_classes). If mode==”regression”, y_out is an array of shape (N, n_tasks).
 Return type
np.ndarray

normalize_prediction_shape
(y: numpy.ndarray, mode: Optional[str] = None, n_tasks: Optional[int] = None, n_classes: Optional[int] = None)[source]¶ A utility function to correct the shape of provided predictions.
The metric computation classes expect that inputs for classification have the uniform shape (N, n_tasks, n_classes) and inputs for regression have the uniform shape (N, n_tasks). This function normalizes the provided input array to have the desired shape.
Examples
>>> import numpy as np >>> y = np.random.rand(10) >>> y_out = normalize_prediction_shape(y, "regression", n_tasks=1) >>> y_out.shape (10, 1)
 Parameters
y (np.ndarray) – If mode==”classification”, y is an array of shape (N,) or (N, n_tasks) or (N, n_tasks, n_classes). If mode==”regression”, y is an array of shape (N,) or (N, n_tasks)`or `(N, n_tasks, 1).
mode (str, default None) – If mode is “classification” or “regression”, attempts to apply data transformations.
n_tasks (int, default None) – The number of tasks this class is expected to handle.
n_classes (int, default None) – If specified use this as the number of classes. Else will try to impute it as n_classes = max(y) + 1 for arrays and as n_classes=2 for the case of scalars. Note this parameter only has value if mode==”classification”
 Returns
y_out – If mode==”classification”, y_out is an array of shape (N, n_tasks, n_classes). If mode==”regression”, y_out is an array of shape (N, n_tasks).
 Return type
np.ndarray

handle_classification_mode
(y: numpy.ndarray, classification_handling_mode: Optional[str] = None, threshold_value: Optional[float] = None) → numpy.ndarray[source]¶ Handle classification mode.
Transform predictions so that they have the correct classification mode.
 Parameters
y (np.ndarray) – Must be of shape (N, n_tasks, n_classes)
classification_handling_mode (str, default None) –
DeepChem models by default predict class probabilities for classification problems. This means that for a given singletask prediction, after shape normalization, the DeepChem prediction will be a numpy array of shape (N, n_classes) with class probabilities. classification_handling_mode is a string that instructs this method how to handle transforming these probabilities. It can take on the following values:  None: default value. Pass in y_pred directy into self.metric.  “threshold”: Use threshold_predictions to threshold y_pred. Use
threshold_value as the desired threshold.
”thresholdonehot”: Use threshold_predictions to threshold y_pred using threshold_values, then apply to_one_hot to output.
threshold_value (float, default None) – If set, and classification_handling_mode is “threshold” or “thresholdonehot” apply a thresholding operation to values with this threshold. This option isj only sensible on binary classification tasks. If float, this will be applied as a binary classification value.
 Returns
y_out – If classification_handling_mode is None, then of shape (N, n_tasks, n_classes). If classification_handling_mode is “threshold”, then of shape (N, n_tasks). If `classification_handling_mode is “thresholdonehot”, then of shape `(N, n_tasks, n_classes)”
 Return type
np.ndarray
Metric Functions¶
DeepChem has a variety of different metrics which are useful for measuring model performance. A number (but not all) of these metrics are directly sourced from sklearn
.

matthews_corrcoef
(y_true, y_pred, *, sample_weight=None)[source]¶ Compute the Matthews correlation coefficient (MCC)
The Matthews correlation coefficient is used in machine learning as a measure of the quality of binary and multiclass classifications. It takes into account true and false positives and negatives and is generally regarded as a balanced measure which can be used even if the classes are of very different sizes. The MCC is in essence a correlation coefficient value between 1 and +1. A coefficient of +1 represents a perfect prediction, 0 an average random prediction and 1 an inverse prediction. The statistic is also known as the phi coefficient. [source: Wikipedia]
Binary and multiclass labels are supported. Only in the binary case does this relate to information about true and false positives and negatives. See references below.
Read more in the User Guide.
 Parameters
y_true (array, shape = [n_samples]) – Ground truth (correct) target values.
y_pred (array, shape = [n_samples]) – Estimated targets as returned by a classifier.
sample_weight (arraylike of shape (n_samples,), default=None) –
Sample weights.
New in version 0.18.
 Returns
mcc – The Matthews correlation coefficient (+1 represents a perfect prediction, 0 an average random prediction and 1 and inverse prediction).
 Return type
float
References
 1
 2
 3
Gorodkin, (2004). Comparing two Kcategory assignments by a Kcategory correlation coefficient
 4
Examples
>>> from sklearn.metrics import matthews_corrcoef >>> y_true = [+1, +1, +1, 1] >>> y_pred = [+1, 1, +1, +1] >>> matthews_corrcoef(y_true, y_pred) 0.33...

recall_score
(y_true, y_pred, *, labels=None, pos_label=1, average='binary', sample_weight=None, zero_division='warn')[source]¶ Compute the recall
The recall is the ratio
tp / (tp + fn)
wheretp
is the number of true positives andfn
the number of false negatives. The recall is intuitively the ability of the classifier to find all the positive samples.The best value is 1 and the worst value is 0.
Read more in the User Guide.
 Parameters
y_true (1d arraylike, or label indicator array / sparse matrix) – Ground truth (correct) target values.
y_pred (1d arraylike, or label indicator array / sparse matrix) – Estimated targets as returned by a classifier.
labels (list, optional) –
The set of labels to include when
average != 'binary'
, and their order ifaverage is None
. Labels present in the data can be excluded, for example to calculate a multiclass average ignoring a majority negative class, while labels not present in the data will result in 0 components in a macro average. For multilabel targets, labels are column indices. By default, all labels iny_true
andy_pred
are used in sorted order.Changed in version 0.17: parameter labels improved for multiclass problem.
pos_label (str or int, 1 by default) – The class to report if
average='binary'
and the data is binary. If the data are multiclass or multilabel, this will be ignored; settinglabels=[pos_label]
andaverage != 'binary'
will report scores for that label only.average (string, [None, 'binary' (default), 'micro', 'macro', 'samples', 'weighted']) –
This parameter is required for multiclass/multilabel targets. If
None
, the scores for each class are returned. Otherwise, this determines the type of averaging performed on the data:'binary'
:Only report results for the class specified by
pos_label
. This is applicable only if targets (y_{true,pred}
) are binary.'micro'
:Calculate metrics globally by counting the total true positives, false negatives and false positives.
'macro'
:Calculate metrics for each label, and find their unweighted mean. This does not take label imbalance into account.
'weighted'
:Calculate metrics for each label, and find their average weighted by support (the number of true instances for each label). This alters ‘macro’ to account for label imbalance; it can result in an Fscore that is not between precision and recall.
'samples'
:Calculate metrics for each instance, and find their average (only meaningful for multilabel classification where this differs from
accuracy_score()
).
sample_weight (arraylike of shape (n_samples,), default=None) – Sample weights.
zero_division ("warn", 0 or 1, default="warn") – Sets the value to return when there is a zero division. If set to “warn”, this acts as 0, but warnings are also raised.
 Returns
recall – Recall of the positive class in binary classification or weighted average of the recall of each class for the multiclass task.
 Return type
float (if average is not None) or array of float, shape = [n_unique_labels]
See also
precision_recall_fscore_support
,balanced_accuracy_score
,multilabel_confusion_matrix
Examples
>>> from sklearn.metrics import recall_score >>> y_true = [0, 1, 2, 0, 1, 2] >>> y_pred = [0, 2, 1, 0, 0, 1] >>> recall_score(y_true, y_pred, average='macro') 0.33... >>> recall_score(y_true, y_pred, average='micro') 0.33... >>> recall_score(y_true, y_pred, average='weighted') 0.33... >>> recall_score(y_true, y_pred, average=None) array([1., 0., 0.]) >>> y_true = [0, 0, 0, 0, 0, 0] >>> recall_score(y_true, y_pred, average=None) array([0.5, 0. , 0. ]) >>> recall_score(y_true, y_pred, average=None, zero_division=1) array([0.5, 1. , 1. ])
Notes
When
true positive + false negative == 0
, recall returns 0 and raisesUndefinedMetricWarning
. This behavior can be modified withzero_division
.

r2_score
(y_true, y_pred, *, sample_weight=None, multioutput='uniform_average')[source]¶ R^2 (coefficient of determination) regression score function.
Best possible score is 1.0 and it can be negative (because the model can be arbitrarily worse). A constant model that always predicts the expected value of y, disregarding the input features, would get a R^2 score of 0.0.
Read more in the User Guide.
 Parameters
y_true (arraylike of shape (n_samples,) or (n_samples, n_outputs)) – Ground truth (correct) target values.
y_pred (arraylike of shape (n_samples,) or (n_samples, n_outputs)) – Estimated target values.
sample_weight (arraylike of shape (n_samples,), optional) – Sample weights.
multioutput (string in ['raw_values', 'uniform_average', 'variance_weighted'] or None or arraylike of shape (n_outputs)) –
Defines aggregating of multiple output scores. Arraylike value defines weights used to average scores. Default is “uniform_average”.
 ’raw_values’ :
Returns a full set of scores in case of multioutput input.
 ’uniform_average’ :
Scores of all outputs are averaged with uniform weight.
 ’variance_weighted’ :
Scores of all outputs are averaged, weighted by the variances of each individual output.
Changed in version 0.19: Default value of multioutput is ‘uniform_average’.
 Returns
z – The R^2 score or ndarray of scores if ‘multioutput’ is ‘raw_values’.
 Return type
float or ndarray of floats
Notes
This is not a symmetric function.
Unlike most other scores, R^2 score may be negative (it need not actually be the square of a quantity R).
This metric is not welldefined for single samples and will return a NaN value if n_samples is less than two.
References
Examples
>>> from sklearn.metrics import r2_score >>> y_true = [3, 0.5, 2, 7] >>> y_pred = [2.5, 0.0, 2, 8] >>> r2_score(y_true, y_pred) 0.948... >>> y_true = [[0.5, 1], [1, 1], [7, 6]] >>> y_pred = [[0, 2], [1, 2], [8, 5]] >>> r2_score(y_true, y_pred, ... multioutput='variance_weighted') 0.938... >>> y_true = [1, 2, 3] >>> y_pred = [1, 2, 3] >>> r2_score(y_true, y_pred) 1.0 >>> y_true = [1, 2, 3] >>> y_pred = [2, 2, 2] >>> r2_score(y_true, y_pred) 0.0 >>> y_true = [1, 2, 3] >>> y_pred = [3, 2, 1] >>> r2_score(y_true, y_pred) 3.0

mean_squared_error
(y_true, y_pred, *, sample_weight=None, multioutput='uniform_average', squared=True)[source]¶ Mean squared error regression loss
Read more in the User Guide.
 Parameters
y_true (arraylike of shape (n_samples,) or (n_samples, n_outputs)) – Ground truth (correct) target values.
y_pred (arraylike of shape (n_samples,) or (n_samples, n_outputs)) – Estimated target values.
sample_weight (arraylike of shape (n_samples,), optional) – Sample weights.
multioutput (string in ['raw_values', 'uniform_average'] or arraylike of shape (n_outputs)) –
Defines aggregating of multiple output values. Arraylike value defines weights used to average errors.
 ’raw_values’ :
Returns a full set of errors in case of multioutput input.
 ’uniform_average’ :
Errors of all outputs are averaged with uniform weight.
squared (boolean value, optional (default = True)) – If True returns MSE value, if False returns RMSE value.
 Returns
loss – A nonnegative floating point value (the best value is 0.0), or an array of floating point values, one for each individual target.
 Return type
float or ndarray of floats
Examples
>>> from sklearn.metrics import mean_squared_error >>> y_true = [3, 0.5, 2, 7] >>> y_pred = [2.5, 0.0, 2, 8] >>> mean_squared_error(y_true, y_pred) 0.375 >>> y_true = [3, 0.5, 2, 7] >>> y_pred = [2.5, 0.0, 2, 8] >>> mean_squared_error(y_true, y_pred, squared=False) 0.612... >>> y_true = [[0.5, 1],[1, 1],[7, 6]] >>> y_pred = [[0, 2],[1, 2],[8, 5]] >>> mean_squared_error(y_true, y_pred) 0.708... >>> mean_squared_error(y_true, y_pred, squared=False) 0.822... >>> mean_squared_error(y_true, y_pred, multioutput='raw_values') array([0.41666667, 1. ]) >>> mean_squared_error(y_true, y_pred, multioutput=[0.3, 0.7]) 0.825...

mean_absolute_error
(y_true, y_pred, *, sample_weight=None, multioutput='uniform_average')[source]¶ Mean absolute error regression loss
Read more in the User Guide.
 Parameters
y_true (arraylike of shape (n_samples,) or (n_samples, n_outputs)) – Ground truth (correct) target values.
y_pred (arraylike of shape (n_samples,) or (n_samples, n_outputs)) – Estimated target values.
sample_weight (arraylike of shape (n_samples,), optional) – Sample weights.
multioutput (string in ['raw_values', 'uniform_average'] or arraylike of shape (n_outputs)) –
Defines aggregating of multiple output values. Arraylike value defines weights used to average errors.
 ’raw_values’ :
Returns a full set of errors in case of multioutput input.
 ’uniform_average’ :
Errors of all outputs are averaged with uniform weight.
 Returns
loss – If multioutput is ‘raw_values’, then mean absolute error is returned for each output separately. If multioutput is ‘uniform_average’ or an ndarray of weights, then the weighted average of all output errors is returned.
MAE output is nonnegative floating point. The best value is 0.0.
 Return type
float or ndarray of floats
Examples
>>> from sklearn.metrics import mean_absolute_error >>> y_true = [3, 0.5, 2, 7] >>> y_pred = [2.5, 0.0, 2, 8] >>> mean_absolute_error(y_true, y_pred) 0.5 >>> y_true = [[0.5, 1], [1, 1], [7, 6]] >>> y_pred = [[0, 2], [1, 2], [8, 5]] >>> mean_absolute_error(y_true, y_pred) 0.75 >>> mean_absolute_error(y_true, y_pred, multioutput='raw_values') array([0.5, 1. ]) >>> mean_absolute_error(y_true, y_pred, multioutput=[0.3, 0.7]) 0.85...

precision_score
(y_true, y_pred, *, labels=None, pos_label=1, average='binary', sample_weight=None, zero_division='warn')[source]¶ Compute the precision
The precision is the ratio
tp / (tp + fp)
wheretp
is the number of true positives andfp
the number of false positives. The precision is intuitively the ability of the classifier not to label as positive a sample that is negative.The best value is 1 and the worst value is 0.
Read more in the User Guide.
 Parameters
y_true (1d arraylike, or label indicator array / sparse matrix) – Ground truth (correct) target values.
y_pred (1d arraylike, or label indicator array / sparse matrix) – Estimated targets as returned by a classifier.
labels (list, optional) –
The set of labels to include when
average != 'binary'
, and their order ifaverage is None
. Labels present in the data can be excluded, for example to calculate a multiclass average ignoring a majority negative class, while labels not present in the data will result in 0 components in a macro average. For multilabel targets, labels are column indices. By default, all labels iny_true
andy_pred
are used in sorted order.Changed in version 0.17: parameter labels improved for multiclass problem.
pos_label (str or int, 1 by default) – The class to report if
average='binary'
and the data is binary. If the data are multiclass or multilabel, this will be ignored; settinglabels=[pos_label]
andaverage != 'binary'
will report scores for that label only.average (string, [None, 'binary' (default), 'micro', 'macro', 'samples', 'weighted']) –
This parameter is required for multiclass/multilabel targets. If
None
, the scores for each class are returned. Otherwise, this determines the type of averaging performed on the data:'binary'
:Only report results for the class specified by
pos_label
. This is applicable only if targets (y_{true,pred}
) are binary.'micro'
:Calculate metrics globally by counting the total true positives, false negatives and false positives.
'macro'
:Calculate metrics for each label, and find their unweighted mean. This does not take label imbalance into account.
'weighted'
:Calculate metrics for each label, and find their average weighted by support (the number of true instances for each label). This alters ‘macro’ to account for label imbalance; it can result in an Fscore that is not between precision and recall.
'samples'
:Calculate metrics for each instance, and find their average (only meaningful for multilabel classification where this differs from
accuracy_score()
).
sample_weight (arraylike of shape (n_samples,), default=None) – Sample weights.
zero_division ("warn", 0 or 1, default="warn") – Sets the value to return when there is a zero division. If set to “warn”, this acts as 0, but warnings are also raised.
 Returns
precision – Precision of the positive class in binary classification or weighted average of the precision of each class for the multiclass task.
 Return type
float (if average is not None) or array of float, shape = [n_unique_labels]
See also
precision_recall_fscore_support
,multilabel_confusion_matrix
Examples
>>> from sklearn.metrics import precision_score >>> y_true = [0, 1, 2, 0, 1, 2] >>> y_pred = [0, 2, 1, 0, 0, 1] >>> precision_score(y_true, y_pred, average='macro') 0.22... >>> precision_score(y_true, y_pred, average='micro') 0.33... >>> precision_score(y_true, y_pred, average='weighted') 0.22... >>> precision_score(y_true, y_pred, average=None) array([0.66..., 0. , 0. ]) >>> y_pred = [0, 0, 0, 0, 0, 0] >>> precision_score(y_true, y_pred, average=None) array([0.33..., 0. , 0. ]) >>> precision_score(y_true, y_pred, average=None, zero_division=1) array([0.33..., 1. , 1. ])
Notes
When
true positive + false positive == 0
, precision returns 0 and raisesUndefinedMetricWarning
. This behavior can be modified withzero_division
.

precision_recall_curve
(y_true, probas_pred, *, pos_label=None, sample_weight=None)[source]¶ Compute precisionrecall pairs for different probability thresholds
Note: this implementation is restricted to the binary classification task.
The precision is the ratio
tp / (tp + fp)
wheretp
is the number of true positives andfp
the number of false positives. The precision is intuitively the ability of the classifier not to label as positive a sample that is negative.The recall is the ratio
tp / (tp + fn)
wheretp
is the number of true positives andfn
the number of false negatives. The recall is intuitively the ability of the classifier to find all the positive samples.The last precision and recall values are 1. and 0. respectively and do not have a corresponding threshold. This ensures that the graph starts on the y axis.
Read more in the User Guide.
 Parameters
y_true (array, shape = [n_samples]) – True binary labels. If labels are not either {1, 1} or {0, 1}, then pos_label should be explicitly given.
probas_pred (array, shape = [n_samples]) – Estimated probabilities or decision function.
pos_label (int or str, default=None) – The label of the positive class. When
pos_label=None
, if y_true is in {1, 1} or {0, 1},pos_label
is set to 1, otherwise an error will be raised.sample_weight (arraylike of shape (n_samples,), default=None) – Sample weights.
 Returns
precision (array, shape = [n_thresholds + 1]) – Precision values such that element i is the precision of predictions with score >= thresholds[i] and the last element is 1.
recall (array, shape = [n_thresholds + 1]) – Decreasing recall values such that element i is the recall of predictions with score >= thresholds[i] and the last element is 0.
thresholds (array, shape = [n_thresholds <= len(np.unique(probas_pred))]) – Increasing thresholds on the decision function used to compute precision and recall.
See also
average_precision_score
Compute average precision from prediction scores
roc_curve
Compute Receiver operating characteristic (ROC) curve
Examples
>>> import numpy as np >>> from sklearn.metrics import precision_recall_curve >>> y_true = np.array([0, 0, 1, 1]) >>> y_scores = np.array([0.1, 0.4, 0.35, 0.8]) >>> precision, recall, thresholds = precision_recall_curve( ... y_true, y_scores) >>> precision array([0.66666667, 0.5 , 1. , 1. ]) >>> recall array([1. , 0.5, 0.5, 0. ]) >>> thresholds array([0.35, 0.4 , 0.8 ])

auc
(x, y)[source]¶ Compute Area Under the Curve (AUC) using the trapezoidal rule
This is a general function, given points on a curve. For computing the area under the ROCcurve, see
roc_auc_score()
. For an alternative way to summarize a precisionrecall curve, seeaverage_precision_score()
. Parameters
x (array, shape = [n]) – x coordinates. These must be either monotonic increasing or monotonic decreasing.
y (array, shape = [n]) – y coordinates.
 Returns
auc
 Return type
float
Examples
>>> import numpy as np >>> from sklearn import metrics >>> y = np.array([1, 1, 2, 2]) >>> pred = np.array([0.1, 0.4, 0.35, 0.8]) >>> fpr, tpr, thresholds = metrics.roc_curve(y, pred, pos_label=2) >>> metrics.auc(fpr, tpr) 0.75
See also
roc_auc_score
Compute the area under the ROC curve
average_precision_score
Compute average precision from prediction scores
precision_recall_curve
Compute precisionrecall pairs for different probability thresholds

jaccard_score
(y_true, y_pred, *, labels=None, pos_label=1, average='binary', sample_weight=None)[source]¶ Jaccard similarity coefficient score
The Jaccard index [1], or Jaccard similarity coefficient, defined as the size of the intersection divided by the size of the union of two label sets, is used to compare set of predicted labels for a sample to the corresponding set of labels in
y_true
.Read more in the User Guide.
 Parameters
y_true (1d arraylike, or label indicator array / sparse matrix) – Ground truth (correct) labels.
y_pred (1d arraylike, or label indicator array / sparse matrix) – Predicted labels, as returned by a classifier.
labels (list, optional) – The set of labels to include when
average != 'binary'
, and their order ifaverage is None
. Labels present in the data can be excluded, for example to calculate a multiclass average ignoring a majority negative class, while labels not present in the data will result in 0 components in a macro average. For multilabel targets, labels are column indices. By default, all labels iny_true
andy_pred
are used in sorted order.pos_label (str or int, 1 by default) – The class to report if
average='binary'
and the data is binary. If the data are multiclass or multilabel, this will be ignored; settinglabels=[pos_label]
andaverage != 'binary'
will report scores for that label only.average (string, [None, 'binary' (default), 'micro', 'macro', 'samples', 'weighted']) –
If
None
, the scores for each class are returned. Otherwise, this determines the type of averaging performed on the data:'binary'
:Only report results for the class specified by
pos_label
. This is applicable only if targets (y_{true,pred}
) are binary.'micro'
:Calculate metrics globally by counting the total true positives, false negatives and false positives.
'macro'
:Calculate metrics for each label, and find their unweighted mean. This does not take label imbalance into account.
'weighted'
:Calculate metrics for each label, and find their average, weighted by support (the number of true instances for each label). This alters ‘macro’ to account for label imbalance.
'samples'
:Calculate metrics for each instance, and find their average (only meaningful for multilabel classification).
sample_weight (arraylike of shape (n_samples,), default=None) – Sample weights.
 Returns
score
 Return type
float (if average is not None) or array of floats, shape = [n_unique_labels]
See also
accuracy_score
,f_score
,multilabel_confusion_matrix
Notes
jaccard_score()
may be a poor metric if there are no positives for some samples or classes. Jaccard is undefined if there are no true or predicted labels, and our implementation will return a score of 0 with a warning.References
Examples
>>> import numpy as np >>> from sklearn.metrics import jaccard_score >>> y_true = np.array([[0, 1, 1], ... [1, 1, 0]]) >>> y_pred = np.array([[1, 1, 1], ... [1, 0, 0]])
In the binary case:
>>> jaccard_score(y_true[0], y_pred[0]) 0.6666...
In the multilabel case:
>>> jaccard_score(y_true, y_pred, average='samples') 0.5833... >>> jaccard_score(y_true, y_pred, average='macro') 0.6666... >>> jaccard_score(y_true, y_pred, average=None) array([0.5, 0.5, 1. ])
In the multiclass case:
>>> y_pred = [0, 2, 1, 2] >>> y_true = [0, 1, 2, 2] >>> jaccard_score(y_true, y_pred, average=None) array([1. , 0. , 0.33...])

f1_score
(y_true, y_pred, *, labels=None, pos_label=1, average='binary', sample_weight=None, zero_division='warn')[source]¶ Compute the F1 score, also known as balanced Fscore or Fmeasure
The F1 score can be interpreted as a weighted average of the precision and recall, where an F1 score reaches its best value at 1 and worst score at 0. The relative contribution of precision and recall to the F1 score are equal. The formula for the F1 score is:
F1 = 2 * (precision * recall) / (precision + recall)
In the multiclass and multilabel case, this is the average of the F1 score of each class with weighting depending on the
average
parameter.Read more in the User Guide.
 Parameters
y_true (1d arraylike, or label indicator array / sparse matrix) – Ground truth (correct) target values.
y_pred (1d arraylike, or label indicator array / sparse matrix) – Estimated targets as returned by a classifier.
labels (list, optional) –
The set of labels to include when
average != 'binary'
, and their order ifaverage is None
. Labels present in the data can be excluded, for example to calculate a multiclass average ignoring a majority negative class, while labels not present in the data will result in 0 components in a macro average. For multilabel targets, labels are column indices. By default, all labels iny_true
andy_pred
are used in sorted order.Changed in version 0.17: parameter labels improved for multiclass problem.
pos_label (str or int, 1 by default) – The class to report if
average='binary'
and the data is binary. If the data are multiclass or multilabel, this will be ignored; settinglabels=[pos_label]
andaverage != 'binary'
will report scores for that label only.average (string, [None, 'binary' (default), 'micro', 'macro', 'samples', 'weighted']) –
This parameter is required for multiclass/multilabel targets. If
None
, the scores for each class are returned. Otherwise, this determines the type of averaging performed on the data:'binary'
:Only report results for the class specified by
pos_label
. This is applicable only if targets (y_{true,pred}
) are binary.'micro'
:Calculate metrics globally by counting the total true positives, false negatives and false positives.
'macro'
:Calculate metrics for each label, and find their unweighted mean. This does not take label imbalance into account.
'weighted'
:Calculate metrics for each label, and find their average weighted by support (the number of true instances for each label). This alters ‘macro’ to account for label imbalance; it can result in an Fscore that is not between precision and recall.
'samples'
:Calculate metrics for each instance, and find their average (only meaningful for multilabel classification where this differs from
accuracy_score()
).
sample_weight (arraylike of shape (n_samples,), default=None) – Sample weights.
zero_division ("warn", 0 or 1, default="warn") – Sets the value to return when there is a zero division, i.e. when all predictions and labels are negative. If set to “warn”, this acts as 0, but warnings are also raised.
 Returns
f1_score – F1 score of the positive class in binary classification or weighted average of the F1 scores of each class for the multiclass task.
 Return type
float or array of float, shape = [n_unique_labels]
See also
fbeta_score
,precision_recall_fscore_support
,jaccard_score
,multilabel_confusion_matrix
References
Examples
>>> from sklearn.metrics import f1_score >>> y_true = [0, 1, 2, 0, 1, 2] >>> y_pred = [0, 2, 1, 0, 0, 1] >>> f1_score(y_true, y_pred, average='macro') 0.26... >>> f1_score(y_true, y_pred, average='micro') 0.33... >>> f1_score(y_true, y_pred, average='weighted') 0.26... >>> f1_score(y_true, y_pred, average=None) array([0.8, 0. , 0. ]) >>> y_true = [0, 0, 0, 0, 0, 0] >>> y_pred = [0, 0, 0, 0, 0, 0] >>> f1_score(y_true, y_pred, zero_division=1) 1.0...
Notes
When
true positive + false positive == 0
, precision is undefined; Whentrue positive + false negative == 0
, recall is undefined. In such cases, by default the metric will be set to 0, as will fscore, andUndefinedMetricWarning
will be raised. This behavior can be modified withzero_division
.

roc_auc_score
(y_true, y_score, *, average='macro', sample_weight=None, max_fpr=None, multi_class='raise', labels=None)[source]¶ Compute Area Under the Receiver Operating Characteristic Curve (ROC AUC) from prediction scores.
Note: this implementation can be used with binary, multiclass and multilabel classification, but some restrictions apply (see Parameters).
Read more in the User Guide.
 Parameters
y_true (arraylike of shape (n_samples,) or (n_samples, n_classes)) – True labels or binary label indicators. The binary and multiclass cases expect labels with shape (n_samples,) while the multilabel case expects binary label indicators with shape (n_samples, n_classes).
y_score (arraylike of shape (n_samples,) or (n_samples, n_classes)) – Target scores. In the binary and multilabel cases, these can be either probability estimates or nonthresholded decision values (as returned by decision_function on some classifiers). In the multiclass case, these must be probability estimates which sum to 1. The binary case expects a shape (n_samples,), and the scores must be the scores of the class with the greater label. The multiclass and multilabel cases expect a shape (n_samples, n_classes). In the multiclass case, the order of the class scores must correspond to the order of
labels
, if provided, or else to the numerical or lexicographical order of the labels iny_true
.average ({'micro', 'macro', 'samples', 'weighted'} or None, default='macro') –
If
None
, the scores for each class are returned. Otherwise, this determines the type of averaging performed on the data: Note: multiclass ROC AUC currently only handles the ‘macro’ and ‘weighted’ averages.'micro'
:Calculate metrics globally by considering each element of the label indicator matrix as a label.
'macro'
:Calculate metrics for each label, and find their unweighted mean. This does not take label imbalance into account.
'weighted'
:Calculate metrics for each label, and find their average, weighted by support (the number of true instances for each label).
'samples'
:Calculate metrics for each instance, and find their average.
Will be ignored when
y_true
is binary.sample_weight (arraylike of shape (n_samples,), default=None) – Sample weights.
max_fpr (float > 0 and <= 1, default=None) – If not
None
, the standardized partial AUC [2]_ over the range [0, max_fpr] is returned. For the multiclass case,max_fpr
, should be either equal toNone
or1.0
as AUC ROC partial computation currently is not supported for multiclass.multi_class ({'raise', 'ovr', 'ovo'}, default='raise') –
Multiclass only. Determines the type of configuration to use. The default value raises an error, so either
'ovr'
or'ovo'
must be passed explicitly.'ovr'
:Computes the AUC of each class against the rest [3]_ [4]_. This treats the multiclass case in the same way as the multilabel case. Sensitive to class imbalance even when
average == 'macro'
, because class imbalance affects the composition of each of the ‘rest’ groupings.'ovo'
:Computes the average AUC of all possible pairwise combinations of classes 5. Insensitive to class imbalance when
average == 'macro'
.
labels (arraylike of shape (n_classes,), default=None) – Multiclass only. List of labels that index the classes in
y_score
. IfNone
, the numerical or lexicographical order of the labels iny_true
is used.
 Returns
auc
 Return type
float
References
 1
 2
 3
Provost, F., Domingos, P. (2000). Welltrained PETs: Improving probability estimation trees (Section 6.2), CeDER Working Paper #IS0004, Stern School of Business, New York University.
 4
Fawcett, T. (2006). An introduction to ROC analysis. Pattern Recognition Letters, 27(8), 861874.
 5
See also
average_precision_score
Area under the precisionrecall curve
roc_curve
Compute Receiver operating characteristic (ROC) curve
Examples
>>> import numpy as np >>> from sklearn.metrics import roc_auc_score >>> y_true = np.array([0, 0, 1, 1]) >>> y_scores = np.array([0.1, 0.4, 0.35, 0.8]) >>> roc_auc_score(y_true, y_scores) 0.75

accuracy_score
(y_true, y_pred, *, normalize=True, sample_weight=None)[source]¶ Accuracy classification score.
In multilabel classification, this function computes subset accuracy: the set of labels predicted for a sample must exactly match the corresponding set of labels in y_true.
Read more in the User Guide.
 Parameters
y_true (1d arraylike, or label indicator array / sparse matrix) – Ground truth (correct) labels.
y_pred (1d arraylike, or label indicator array / sparse matrix) – Predicted labels, as returned by a classifier.
normalize (bool, optional (default=True)) – If
False
, return the number of correctly classified samples. Otherwise, return the fraction of correctly classified samples.sample_weight (arraylike of shape (n_samples,), default=None) – Sample weights.
 Returns
score – If
normalize == True
, return the fraction of correctly classified samples (float), else returns the number of correctly classified samples (int).The best performance is 1 with
normalize == True
and the number of samples withnormalize == False
. Return type
float
See also
jaccard_score
,hamming_loss
,zero_one_loss
Notes
In binary and multiclass classification, this function is equal to the
jaccard_score
function.Examples
>>> from sklearn.metrics import accuracy_score >>> y_pred = [0, 2, 1, 3] >>> y_true = [0, 1, 2, 3] >>> accuracy_score(y_true, y_pred) 0.5 >>> accuracy_score(y_true, y_pred, normalize=False) 2
In the multilabel case with binary label indicators:
>>> import numpy as np >>> accuracy_score(np.array([[0, 1], [1, 1]]), np.ones((2, 2))) 0.5

balanced_accuracy_score
(y_true, y_pred, *, sample_weight=None, adjusted=False)[source]¶ Compute the balanced accuracy
The balanced accuracy in binary and multiclass classification problems to deal with imbalanced datasets. It is defined as the average of recall obtained on each class.
The best value is 1 and the worst value is 0 when
adjusted=False
.Read more in the User Guide.
New in version 0.20.
 Parameters
y_true (1d arraylike) – Ground truth (correct) target values.
y_pred (1d arraylike) – Estimated targets as returned by a classifier.
sample_weight (arraylike of shape (n_samples,), default=None) – Sample weights.
adjusted (bool, default=False) – When true, the result is adjusted for chance, so that random performance would score 0, and perfect performance scores 1.
 Returns
balanced_accuracy
 Return type
float
See also
Notes
Some literature promotes alternative definitions of balanced accuracy. Our definition is equivalent to
accuracy_score()
with classbalanced sample weights, and shares desirable properties with the binary case. See the User Guide.References
 1
Brodersen, K.H.; Ong, C.S.; Stephan, K.E.; Buhmann, J.M. (2010). The balanced accuracy and its posterior distribution. Proceedings of the 20th International Conference on Pattern Recognition, 312124.
 2
John. D. Kelleher, Brian Mac Namee, Aoife D’Arcy, (2015). Fundamentals of Machine Learning for Predictive Data Analytics: Algorithms, Worked Examples, and Case Studies.
Examples
>>> from sklearn.metrics import balanced_accuracy_score >>> y_true = [0, 1, 0, 0, 1, 0] >>> y_pred = [0, 1, 0, 0, 0, 1] >>> balanced_accuracy_score(y_true, y_pred) 0.625

pearson_r2_score
(y: numpy.ndarray, y_pred: numpy.ndarray) → float[source]¶ Computes Pearson R^2 (square of Pearson correlation).
 Parameters
y (np.ndarray) – ground truth array
y_pred (np.ndarray) – predicted array
 Returns
The PearsonR^2 score.
 Return type
float

jaccard_index
(y: numpy.ndarray, y_pred: numpy.ndarray) → float[source]¶ Computes Jaccard Index which is the Intersection Over Union metric which is commonly used in image segmentation tasks.
DEPRECATED: WILL BE REMOVED IN A FUTURE VERSION OF DEEEPCHEM. USE jaccard_score instead.
 Parameters
y (np.ndarray) – ground truth array
y_pred (np.ndarray) – predicted array
 Returns
score – The jaccard index. A number between 0 and 1.
 Return type
float

pixel_error
(y: numpy.ndarray, y_pred: numpy.ndarray) → float[source]¶ An error metric in case y, y_pred are images.
Defined as 1  the maximal Fscore of pixel similarity, or squared Euclidean distance between the original and the result labels.
 Parameters
y (np.ndarray) – ground truth array
y_pred (np.ndarray) – predicted array
 Returns
score – The pixelerror. A number between 0 and 1.
 Return type
float

prc_auc_score
(y: numpy.ndarray, y_pred: numpy.ndarray) → float[source]¶ Compute area under precisionrecall curve
 Parameters
y (np.ndarray) – A numpy array of shape (N, n_classes) or (N,) with true labels
y_pred (np.ndarray) – Of shape (N, n_classes) with class probabilities.
 Returns
The area under the precisionrecall curve. A number between 0 and 1.
 Return type
float

kappa_score
(y1, y2, *, labels=None, weights=None, sample_weight=None)[source]¶ Cohen’s kappa: a statistic that measures interannotator agreement.
This function computes Cohen’s kappa [1]_, a score that expresses the level of agreement between two annotators on a classification problem. It is defined as
\[\kappa = (p_o  p_e) / (1  p_e)\]where \(p_o\) is the empirical probability of agreement on the label assigned to any sample (the observed agreement ratio), and \(p_e\) is the expected agreement when both annotators assign labels randomly. \(p_e\) is estimated using a perannotator empirical prior over the class labels [2]_.
Read more in the User Guide.
 Parameters
y1 (array, shape = [n_samples]) – Labels assigned by the first annotator.
y2 (array, shape = [n_samples]) – Labels assigned by the second annotator. The kappa statistic is symmetric, so swapping
y1
andy2
doesn’t change the value.labels (array, shape = [n_classes], optional) – List of labels to index the matrix. This may be used to select a subset of labels. If None, all labels that appear at least once in
y1
ory2
are used.weights (str, optional) – Weighting type to calculate the score. None means no weighted; “linear” means linear weighted; “quadratic” means quadratic weighted.
sample_weight (arraylike of shape (n_samples,), default=None) – Sample weights.
 Returns
kappa – The kappa statistic, which is a number between 1 and 1. The maximum value means complete agreement; zero or lower means chance agreement.
 Return type
float
References
 1
J. Cohen (1960). “A coefficient of agreement for nominal scales”. Educational and Psychological Measurement 20(1):3746. doi:10.1177/001316446002000104.
 2
 3

bedroc_score
(y_true: numpy.ndarray, y_pred: numpy.ndarray, alpha: float = 20.0)[source]¶ Compute BEDROC metric.
BEDROC metric implemented according to Truchon and Bayley that modifies the ROC score by allowing for a factor of early recognition. Please confirm details from [1]_.
 Parameters
y_true (np.ndarray) – Binary class labels. 1 for positive class, 0 otherwise
y_pred (np.ndarray) – Predicted labels
alpha (float, default 20.0) – Early recognition parameter
 Returns
Value in [0, 1] that indicates the degree of early recognition
 Return type
float
Notes
This function requires RDKit to be installed.
References
 1
Truchon et al. “Evaluating virtual screening methods: good and bad metrics for the “early recognition” problem.” Journal of chemical information and modeling 47.2 (2007): 488508.

concordance_index
(y_true: numpy.ndarray, y_pred: numpy.ndarray) → float[source]¶ Compute Concordance index.
Statistical metric indicates the quality of the predicted ranking. Please confirm details from [1]_.
 Parameters
y_true (np.ndarray) – continous value
y_pred (np.ndarray) – Predicted value
 Returns
score between [0,1]
 Return type
float
References
 1
Steck, Harald, et al. “On ranking in survival analysis: Bounds on the concordance index.” Advances in neural information processing systems (2008): 12091216.

get_motif_scores
(encoded_sequences: numpy.ndarray, motif_names: List[str], max_scores: Optional[int] = None, return_positions: bool = False, GC_fraction: float = 0.4) → numpy.ndarray[source]¶ Computes pwm log odds.
 Parameters
encoded_sequences (np.ndarray) – A numpy array of shape (N_sequences, N_letters, sequence_length, 1).
motif_names (List[str]) – List of motif file names.
max_scores (int, optional) – Get top max_scores scores.
return_positions (bool, default False) – Whether to return postions or not.
GC_fraction (float, default 0.4) – GC fraction in background sequence.
 Returns
A numpy array of complete score. The shape is (N_sequences, num_motifs, seq_length) by default. If max_scores, the shape of score array is (N_sequences, num_motifs*max_scores). If max_scores and return_positions, the shape of score array with max scores and their positions. is (N_sequences, 2*num_motifs*max_scores).
 Return type
np.ndarray
Notes
This method requires simdna to be installed.

get_pssm_scores
(encoded_sequences: numpy.ndarray, pssm: numpy.ndarray) → numpy.ndarray[source]¶ Convolves pssm and its reverse complement with encoded sequences and returns the maximum score at each position of each sequence.
 Parameters
encoded_sequences (np.ndarray) – A numpy array of shape (N_sequences, N_letters, sequence_length, 1).
pssm (np.ndarray) – A numpy array of shape (4, pssm_length).
 Returns
scores – A numpy array of shape (N_sequences, sequence_length).
 Return type
np.ndarray

in_silico_mutagenesis
(model: deepchem.models.models.Model, encoded_sequences: numpy.ndarray) → numpy.ndarray[source]¶ Computes insilicomutagenesis scores
 Parameters
model (Model) – This can be any model that accepts inputs of the required shape and produces an output of shape (N_sequences, N_tasks).
encoded_sequences (np.ndarray) – A numpy array of shape (N_sequences, N_letters, sequence_length, 1)
 Returns
A numpy array of ISM scores. The shape is (num_task, N_sequences, N_letters, sequence_length, 1).
 Return type
np.ndarray
Metric Class¶
The dc.metrics.Metric
class is a wrapper around metric
functions which interoperates with DeepChem dc.models.Model
.

class
Metric
(metric: Callable[[…], float], task_averager: Optional[Callable[[…], Any]] = None, name: Optional[str] = None, threshold: Optional[float] = None, mode: Optional[str] = None, n_tasks: Optional[int] = None, classification_handling_mode: Optional[str] = None, threshold_value: Optional[float] = None, compute_energy_metric: Optional[bool] = None)[source]¶ Wrapper class for computing userdefined metrics.
The Metric class provides a wrapper for standardizing the API around different classes of metrics that may be useful for DeepChem models. The implementation provides a few nonstandard conveniences such as builtin support for multitask and multiclass metrics.
There are a variety of different metrics this class aims to support. Metrics for classification and regression that assume that values to compare are scalars are supported.
At present, this class doesn’t support metric computation on models which don’t present scalar outputs. For example, if you have a generative model which predicts images or molecules, you will need to write a custom evaluation and metric setup.

__init__
(metric: Callable[[…], float], task_averager: Optional[Callable[[…], Any]] = None, name: Optional[str] = None, threshold: Optional[float] = None, mode: Optional[str] = None, n_tasks: Optional[int] = None, classification_handling_mode: Optional[str] = None, threshold_value: Optional[float] = None, compute_energy_metric: Optional[bool] = None)[source]¶  Parameters
metric (function) – Function that takes args y_true, y_pred (in that order) and computes desired score. If sample weights are to be considered, metric may take in an additional keyword argument sample_weight.
task_averager (function, default None) – If not None, should be a function that averages metrics across tasks.
name (str, default None) – Name of this metric
threshold (float, default None (DEPRECATED)) – Used for binary metrics and is the threshold for the positive class.
mode (str, default None) – Should usually be “classification” or “regression.”
n_tasks (int, default None) – The number of tasks this class is expected to handle.
classification_handling_mode (str, default None) –
DeepChem models by default predict class probabilities for classification problems. This means that for a given singletask prediction, after shape normalization, the DeepChem prediction will be a numpy array of shape (N, n_classes) with class probabilities. classification_handling_mode is a string that instructs this method how to handle transforming these probabilities. It can take on the following values:  None: default value. Pass in y_pred directy into self.metric.  “threshold”: Use threshold_predictions to threshold y_pred. Use
threshold_value as the desired threshold.
”thresholdonehot”: Use threshold_predictions to threshold y_pred using threshold_values, then apply to_one_hot to output.
threshold_value (float, default None) – If set, and classification_handling_mode is “threshold” or “thresholdonehot” apply a thresholding operation to values with this threshold. This option is only sensible on binary classification tasks. If float, this will be applied as a binary classification value.
compute_energy_metric (bool, default None (DEPRECATED)) – Deprecated metric. Will be removed in a future version of DeepChem. Do not use.

compute_metric
(y_true: numpy.ndarray, y_pred: numpy.ndarray, w: Optional[numpy.ndarray] = None, n_tasks: Optional[int] = None, n_classes: int = 2, filter_nans: bool = False, per_task_metrics: bool = False, use_sample_weights: bool = False, **kwargs) → numpy.ndarray[source]¶ Compute a performance metric for each task.
 Parameters
y_true (np.ndarray) – An np.ndarray containing true values for each task. Must be of shape (N,) or (N, n_tasks) or (N, n_tasks, n_classes) if a classification metric. If of shape (N, n_tasks) values can either be classlabels or probabilities of the positive class for binary classification problems. If a regression problem, must be of shape (N,) or (N, n_tasks) or (N, n_tasks, 1) if a regression metric.
y_pred (np.ndarray) – An np.ndarray containing predicted values for each task. Must be of shape (N, n_tasks, n_classes) if a classification metric, else must be of shape (N, n_tasks) if a regression metric.
w (np.ndarray, default None) – An np.ndarray containing weights for each datapoint. If specified, must be of shape (N, n_tasks).
n_tasks (int, default None) – The number of tasks this class is expected to handle.
n_classes (int, default 2) – Number of classes in data for classification tasks.
filter_nans (bool, default False (DEPRECATED)) – Remove NaN values in computed metrics
per_task_metrics (bool, default False) – If true, return computed metric for each task on multitask dataset.
use_sample_weights (bool, default False) – If set, use persample weights w.
kwargs (dict) – Will be passed on to self.metric
 Returns
A numpy array containing metric values for each task.
 Return type
np.ndarray

compute_singletask_metric
(y_true: numpy.ndarray, y_pred: numpy.ndarray, w: Optional[numpy.ndarray] = None, n_samples: Optional[int] = None, use_sample_weights: bool = False, **kwargs) → float[source]¶ Compute a metric value.
 Parameters
y_true (np.ndarray) – True values array. This array must be of shape (N, n_classes) if classification and (N,) if regression.
y_pred (np.ndarray) – Predictions array. This array must be of shape (N, n_classes) if classification and (N,) if regression.
w (np.ndarray, default None) – Sample weight array. This array must be of shape (N,)
n_samples (int, default None (DEPRECATED)) – The number of samples in the dataset. This is N. This argument is ignored.
use_sample_weights (bool, default False) – If set, use persample weights w.
kwargs (dict) – Will be passed on to self.metric
 Returns
metric_value – The computed value of the metric.
 Return type
float
