# Metrics¶

Metrics are one of the most important parts of machine learning. Unlike traditional software, in which algorithms either work or don’t work, machine learning models work in degrees. That is, there’s a continuous range of “goodness” for a model. “Metrics” are functions which measure how well a model works. There are many different choices of metrics depending on the type of model at hand.

## Metric Utilities¶

Metric utility functions allow for some common manipulations such as switching to/from one-hot representations.

to_one_hot(y: numpy.ndarray, n_classes: int = 2)numpy.ndarray[source]

Transforms label vector into one-hot encoding.

Turns y into vector of shape (N, n_classes) with a one-hot encoding. Assumes that y takes values from 0 to n_classes - 1.

Parameters
• y (np.ndarray) – A vector of shape (N,) or (N, 1)

• n_classes (int, default 2) – If specified use this as the number of classes. Else will try to impute it as n_classes = max(y) + 1 for arrays and as n_classes=2 for the case of scalars. Note this parameter only has value if mode==”classification”

Returns

A numpy array of shape (N, n_classes).

Return type

np.ndarray

from_one_hot(y: numpy.ndarray, axis: int = 1)numpy.ndarray[source]

Transforms label vector from one-hot encoding.

Parameters
• y (np.ndarray) – A vector of shape (n_samples, num_classes)

• axis (int, optional (default 1)) – The axis with one-hot encodings to reduce on.

Returns

A numpy array of shape (n_samples,)

Return type

np.ndarray

## Metric Shape Handling¶

One of the trickiest parts of handling metrics correctly is making sure the shapes of input weights, predictions and labels and processed correctly. This is challenging in particular since DeepChem supports multitask, multiclass models which means that shapes must be handled with care to prevent errors. DeepChem maintains the following utility functions which attempt to facilitate shape handling for you.

normalize_weight_shape(w: Optional[numpy.ndarray], n_samples: int, n_tasks: int)numpy.ndarray[source]

A utility function to correct the shape of the weight array.

This utility function is used to normalize the shapes of a given weight array.

Parameters
• w (np.ndarray) – w can be None or a scalar or a np.ndarray of shape (n_samples,) or of shape (n_samples, n_tasks). If w is a scalar, it’s assumed to be the same weight for all samples/tasks.

• n_samples (int) – The number of samples in the dataset. If w is not None, we should have n_samples = w.shape[0] if w is a ndarray

• n_tasks (int) – The number of tasks. If w is 2d ndarray, then we should have w.shape[1] == n_tasks.

Examples

>>> import numpy as np
>>> w_out = normalize_weight_shape(None, n_samples=10, n_tasks=1)
>>> (w_out == np.ones((10, 1))).all()
True

Returns

w_out – Array of shape (n_samples, n_tasks)

Return type

np.ndarray

normalize_labels_shape(y: numpy.ndarray, mode: Optional[str] = None, n_tasks: Optional[int] = None, n_classes: Optional[int] = None)numpy.ndarray[source]

A utility function to correct the shape of the labels.

Parameters
• y (np.ndarray) – y is an array of shape (N,) or (N, n_tasks) or (N, n_tasks, 1).

• mode (str, default None) – If mode is “classification” or “regression”, attempts to apply data transformations.

• n_tasks (int, default None) – The number of tasks this class is expected to handle.

• n_classes (int, default None) – If specified use this as the number of classes. Else will try to impute it as n_classes = max(y) + 1 for arrays and as n_classes=2 for the case of scalars. Note this parameter only has value if mode==”classification”

Returns

y_out – If mode==”classification”, y_out is an array of shape (N, n_tasks, n_classes). If mode==”regression”, y_out is an array of shape (N, n_tasks).

Return type

np.ndarray

normalize_prediction_shape(y: numpy.ndarray, mode: Optional[str] = None, n_tasks: Optional[int] = None, n_classes: Optional[int] = None)[source]

A utility function to correct the shape of provided predictions.

The metric computation classes expect that inputs for classification have the uniform shape (N, n_tasks, n_classes) and inputs for regression have the uniform shape (N, n_tasks). This function normalizes the provided input array to have the desired shape.

Examples

>>> import numpy as np
>>> y = np.random.rand(10)
>>> y_out = normalize_prediction_shape(y, "regression", n_tasks=1)
>>> y_out.shape
(10, 1)

Parameters
• y (np.ndarray) – If mode==”classification”, y is an array of shape (N,) or (N, n_tasks) or (N, n_tasks, n_classes). If mode==”regression”, y is an array of shape (N,) or (N, n_tasks)or (N, n_tasks, 1).

• mode (str, default None) – If mode is “classification” or “regression”, attempts to apply data transformations.

• n_tasks (int, default None) – The number of tasks this class is expected to handle.

• n_classes (int, default None) – If specified use this as the number of classes. Else will try to impute it as n_classes = max(y) + 1 for arrays and as n_classes=2 for the case of scalars. Note this parameter only has value if mode==”classification”

Returns

y_out – If mode==”classification”, y_out is an array of shape (N, n_tasks, n_classes). If mode==”regression”, y_out is an array of shape (N, n_tasks).

Return type

np.ndarray

handle_classification_mode(y: numpy.ndarray, classification_handling_mode: Optional[str], threshold_value: Optional[float] = None)numpy.ndarray[source]

Handle classification mode.

Transform predictions so that they have the correct classification mode.

Parameters
• y (np.ndarray) – Must be of shape (N, n_tasks, n_classes)

• classification_handling_mode (str, default None) –

DeepChem models by default predict class probabilities for classification problems. This means that for a given singletask prediction, after shape normalization, the DeepChem prediction will be a numpy array of shape (N, n_classes) with class probabilities. classification_handling_mode is a string that instructs this method how to handle transforming these probabilities. It can take on the following values: - None: default value. Pass in y_pred directy into self.metric. - “threshold”: Use threshold_predictions to threshold y_pred. Use

threshold_value as the desired threshold.

• ”threshold-one-hot”: Use threshold_predictions to threshold y_pred using threshold_values, then apply to_one_hot to output.

• threshold_value (float, default None) – If set, and classification_handling_mode is “threshold” or “threshold-one-hot” apply a thresholding operation to values with this threshold. This option isj only sensible on binary classification tasks. If float, this will be applied as a binary classification value.

Returns

y_out – If classification_handling_mode is “direct”, then of shape (N, n_tasks, n_classes). If classification_handling_mode is “threshold”, then of shape (N, n_tasks). If classification_handling_mode is “threshold-one-hot”, then of shape (N, n_tasks, n_classes)”

Return type

np.ndarray

## Metric Functions¶

DeepChem has a variety of different metrics which are useful for measuring model performance. A number (but not all) of these metrics are directly sourced from sklearn.

matthews_corrcoef(y_true, y_pred, *, sample_weight=None)[source]

Compute the Matthews correlation coefficient (MCC).

The Matthews correlation coefficient is used in machine learning as a measure of the quality of binary and multiclass classifications. It takes into account true and false positives and negatives and is generally regarded as a balanced measure which can be used even if the classes are of very different sizes. The MCC is in essence a correlation coefficient value between -1 and +1. A coefficient of +1 represents a perfect prediction, 0 an average random prediction and -1 an inverse prediction. The statistic is also known as the phi coefficient. [source: Wikipedia]

Binary and multiclass labels are supported. Only in the binary case does this relate to information about true and false positives and negatives. See references below.

Read more in the User Guide.

Parameters
• y_true (array, shape = [n_samples]) – Ground truth (correct) target values.

• y_pred (array, shape = [n_samples]) – Estimated targets as returned by a classifier.

• sample_weight (array-like of shape (n_samples,), default=None) –

Sample weights.

New in version 0.18.

Returns

mcc – The Matthews correlation coefficient (+1 represents a perfect prediction, 0 an average random prediction and -1 and inverse prediction).

Return type

float

References

1
2
3
4

Examples

>>> from sklearn.metrics import matthews_corrcoef
>>> y_true = [+1, +1, +1, -1]
>>> y_pred = [+1, -1, +1, +1]
>>> matthews_corrcoef(y_true, y_pred)
-0.33...

recall_score(y_true, y_pred, *, labels=None, pos_label=1, average='binary', sample_weight=None, zero_division='warn')[source]

Compute the recall.

The recall is the ratio tp / (tp + fn) where tp is the number of true positives and fn the number of false negatives. The recall is intuitively the ability of the classifier to find all the positive samples.

The best value is 1 and the worst value is 0.

Read more in the User Guide.

Parameters
• y_true (1d array-like, or label indicator array / sparse matrix) – Ground truth (correct) target values.

• y_pred (1d array-like, or label indicator array / sparse matrix) – Estimated targets as returned by a classifier.

• labels (array-like, default=None) –

The set of labels to include when average != 'binary', and their order if average is None. Labels present in the data can be excluded, for example to calculate a multiclass average ignoring a majority negative class, while labels not present in the data will result in 0 components in a macro average. For multilabel targets, labels are column indices. By default, all labels in y_true and y_pred are used in sorted order.

Changed in version 0.17: Parameter labels improved for multiclass problem.

• pos_label (str or int, default=1) – The class to report if average='binary' and the data is binary. If the data are multiclass or multilabel, this will be ignored; setting labels=[pos_label] and average != 'binary' will report scores for that label only.

• average ({'micro', 'macro', 'samples', 'weighted', 'binary'} or None, default='binary') –

This parameter is required for multiclass/multilabel targets. If None, the scores for each class are returned. Otherwise, this determines the type of averaging performed on the data:

'binary':

Only report results for the class specified by pos_label. This is applicable only if targets (y_{true,pred}) are binary.

'micro':

Calculate metrics globally by counting the total true positives, false negatives and false positives.

'macro':

Calculate metrics for each label, and find their unweighted mean. This does not take label imbalance into account.

'weighted':

Calculate metrics for each label, and find their average weighted by support (the number of true instances for each label). This alters ‘macro’ to account for label imbalance; it can result in an F-score that is not between precision and recall. Weighted recall is equal to accuracy.

'samples':

Calculate metrics for each instance, and find their average (only meaningful for multilabel classification where this differs from accuracy_score()).

• sample_weight (array-like of shape (n_samples,), default=None) – Sample weights.

• zero_division ("warn", 0 or 1, default="warn") – Sets the value to return when there is a zero division. If set to “warn”, this acts as 0, but warnings are also raised.

Returns

recall – Recall of the positive class in binary classification or weighted average of the recall of each class for the multiclass task.

Return type

float (if average is not None) or array of float of shape (n_unique_labels,)

precision_recall_fscore_support

Compute precision, recall, F-measure and support for each class.

precision_score

Compute the ratio tp / (tp + fp) where tp is the number of true positives and fp the number of false positives.

balanced_accuracy_score

Compute balanced accuracy to deal with imbalanced datasets.

multilabel_confusion_matrix

Compute a confusion matrix for each class or sample.

PrecisionRecallDisplay.from_estimator

Plot precision-recall curve given an estimator and some data.

PrecisionRecallDisplay.from_predictions

Plot precision-recall curve given binary class predictions.

Notes

When true positive + false negative == 0, recall returns 0 and raises UndefinedMetricWarning. This behavior can be modified with zero_division.

Examples

>>> from sklearn.metrics import recall_score
>>> y_true = [0, 1, 2, 0, 1, 2]
>>> y_pred = [0, 2, 1, 0, 0, 1]
>>> recall_score(y_true, y_pred, average='macro')
0.33...
>>> recall_score(y_true, y_pred, average='micro')
0.33...
>>> recall_score(y_true, y_pred, average='weighted')
0.33...
>>> recall_score(y_true, y_pred, average=None)
array([1., 0., 0.])
>>> y_true = [0, 0, 0, 0, 0, 0]
>>> recall_score(y_true, y_pred, average=None)
array([0.5, 0. , 0. ])
>>> recall_score(y_true, y_pred, average=None, zero_division=1)
array([0.5, 1. , 1. ])
>>> # multilabel classification
>>> y_true = [[0, 0, 0], [1, 1, 1], [0, 1, 1]]
>>> y_pred = [[0, 0, 0], [1, 1, 1], [1, 1, 0]]
>>> recall_score(y_true, y_pred, average=None)
array([1. , 1. , 0.5])

r2_score(y_true, y_pred, *, sample_weight=None, multioutput='uniform_average')[source]

$$R^2$$ (coefficient of determination) regression score function.

Best possible score is 1.0 and it can be negative (because the model can be arbitrarily worse). A constant model that always predicts the expected value of y, disregarding the input features, would get a $$R^2$$ score of 0.0.

Read more in the User Guide.

Parameters
• y_true (array-like of shape (n_samples,) or (n_samples, n_outputs)) – Ground truth (correct) target values.

• y_pred (array-like of shape (n_samples,) or (n_samples, n_outputs)) – Estimated target values.

• sample_weight (array-like of shape (n_samples,), default=None) – Sample weights.

• multioutput ({'raw_values', 'uniform_average', 'variance_weighted'}, array-like of shape (n_outputs,) or None, default='uniform_average') –

Defines aggregating of multiple output scores. Array-like value defines weights used to average scores. Default is “uniform_average”.

’raw_values’ :

Returns a full set of scores in case of multioutput input.

’uniform_average’ :

Scores of all outputs are averaged with uniform weight.

’variance_weighted’ :

Scores of all outputs are averaged, weighted by the variances of each individual output.

Changed in version 0.19: Default value of multioutput is ‘uniform_average’.

Returns

z – The $$R^2$$ score or ndarray of scores if ‘multioutput’ is ‘raw_values’.

Return type

float or ndarray of floats

Notes

This is not a symmetric function.

Unlike most other scores, $$R^2$$ score may be negative (it need not actually be the square of a quantity R).

This metric is not well-defined for single samples and will return a NaN value if n_samples is less than two.

References

1

Wikipedia entry on the Coefficient of determination

Examples

>>> from sklearn.metrics import r2_score
>>> y_true = [3, -0.5, 2, 7]
>>> y_pred = [2.5, 0.0, 2, 8]
>>> r2_score(y_true, y_pred)
0.948...
>>> y_true = [[0.5, 1], [-1, 1], [7, -6]]
>>> y_pred = [[0, 2], [-1, 2], [8, -5]]
>>> r2_score(y_true, y_pred,
...          multioutput='variance_weighted')
0.938...
>>> y_true = [1, 2, 3]
>>> y_pred = [1, 2, 3]
>>> r2_score(y_true, y_pred)
1.0
>>> y_true = [1, 2, 3]
>>> y_pred = [2, 2, 2]
>>> r2_score(y_true, y_pred)
0.0
>>> y_true = [1, 2, 3]
>>> y_pred = [3, 2, 1]
>>> r2_score(y_true, y_pred)
-3.0

mean_squared_error(y_true, y_pred, *, sample_weight=None, multioutput='uniform_average', squared=True)[source]

Mean squared error regression loss.

Read more in the User Guide.

Parameters
• y_true (array-like of shape (n_samples,) or (n_samples, n_outputs)) – Ground truth (correct) target values.

• y_pred (array-like of shape (n_samples,) or (n_samples, n_outputs)) – Estimated target values.

• sample_weight (array-like of shape (n_samples,), default=None) – Sample weights.

• multioutput ({'raw_values', 'uniform_average'} or array-like of shape (n_outputs,), default='uniform_average') –

Defines aggregating of multiple output values. Array-like value defines weights used to average errors.

’raw_values’ :

Returns a full set of errors in case of multioutput input.

’uniform_average’ :

Errors of all outputs are averaged with uniform weight.

• squared (bool, default=True) – If True returns MSE value, if False returns RMSE value.

Returns

loss – A non-negative floating point value (the best value is 0.0), or an array of floating point values, one for each individual target.

Return type

float or ndarray of floats

Examples

>>> from sklearn.metrics import mean_squared_error
>>> y_true = [3, -0.5, 2, 7]
>>> y_pred = [2.5, 0.0, 2, 8]
>>> mean_squared_error(y_true, y_pred)
0.375
>>> y_true = [3, -0.5, 2, 7]
>>> y_pred = [2.5, 0.0, 2, 8]
>>> mean_squared_error(y_true, y_pred, squared=False)
0.612...
>>> y_true = [[0.5, 1],[-1, 1],[7, -6]]
>>> y_pred = [[0, 2],[-1, 2],[8, -5]]
>>> mean_squared_error(y_true, y_pred)
0.708...
>>> mean_squared_error(y_true, y_pred, squared=False)
0.822...
>>> mean_squared_error(y_true, y_pred, multioutput='raw_values')
array([0.41666667, 1.        ])
>>> mean_squared_error(y_true, y_pred, multioutput=[0.3, 0.7])
0.825...

mean_absolute_error(y_true, y_pred, *, sample_weight=None, multioutput='uniform_average')[source]

Mean absolute error regression loss.

Read more in the User Guide.

Parameters
• y_true (array-like of shape (n_samples,) or (n_samples, n_outputs)) – Ground truth (correct) target values.

• y_pred (array-like of shape (n_samples,) or (n_samples, n_outputs)) – Estimated target values.

• sample_weight (array-like of shape (n_samples,), default=None) – Sample weights.

• multioutput ({'raw_values', 'uniform_average'} or array-like of shape (n_outputs,), default='uniform_average') –

Defines aggregating of multiple output values. Array-like value defines weights used to average errors.

’raw_values’ :

Returns a full set of errors in case of multioutput input.

’uniform_average’ :

Errors of all outputs are averaged with uniform weight.

Returns

loss – If multioutput is ‘raw_values’, then mean absolute error is returned for each output separately. If multioutput is ‘uniform_average’ or an ndarray of weights, then the weighted average of all output errors is returned.

MAE output is non-negative floating point. The best value is 0.0.

Return type

float or ndarray of floats

Examples

>>> from sklearn.metrics import mean_absolute_error
>>> y_true = [3, -0.5, 2, 7]
>>> y_pred = [2.5, 0.0, 2, 8]
>>> mean_absolute_error(y_true, y_pred)
0.5
>>> y_true = [[0.5, 1], [-1, 1], [7, -6]]
>>> y_pred = [[0, 2], [-1, 2], [8, -5]]
>>> mean_absolute_error(y_true, y_pred)
0.75
>>> mean_absolute_error(y_true, y_pred, multioutput='raw_values')
array([0.5, 1. ])
>>> mean_absolute_error(y_true, y_pred, multioutput=[0.3, 0.7])
0.85...

precision_score(y_true, y_pred, *, labels=None, pos_label=1, average='binary', sample_weight=None, zero_division='warn')[source]

Compute the precision.

The precision is the ratio tp / (tp + fp) where tp is the number of true positives and fp the number of false positives. The precision is intuitively the ability of the classifier not to label as positive a sample that is negative.

The best value is 1 and the worst value is 0.

Read more in the User Guide.

Parameters
• y_true (1d array-like, or label indicator array / sparse matrix) – Ground truth (correct) target values.

• y_pred (1d array-like, or label indicator array / sparse matrix) – Estimated targets as returned by a classifier.

• labels (array-like, default=None) –

The set of labels to include when average != 'binary', and their order if average is None. Labels present in the data can be excluded, for example to calculate a multiclass average ignoring a majority negative class, while labels not present in the data will result in 0 components in a macro average. For multilabel targets, labels are column indices. By default, all labels in y_true and y_pred are used in sorted order.

Changed in version 0.17: Parameter labels improved for multiclass problem.

• pos_label (str or int, default=1) – The class to report if average='binary' and the data is binary. If the data are multiclass or multilabel, this will be ignored; setting labels=[pos_label] and average != 'binary' will report scores for that label only.

• average ({'micro', 'macro', 'samples', 'weighted', 'binary'} or None, default='binary') –

This parameter is required for multiclass/multilabel targets. If None, the scores for each class are returned. Otherwise, this determines the type of averaging performed on the data:

'binary':

Only report results for the class specified by pos_label. This is applicable only if targets (y_{true,pred}) are binary.

'micro':

Calculate metrics globally by counting the total true positives, false negatives and false positives.

'macro':

Calculate metrics for each label, and find their unweighted mean. This does not take label imbalance into account.

'weighted':

Calculate metrics for each label, and find their average weighted by support (the number of true instances for each label). This alters ‘macro’ to account for label imbalance; it can result in an F-score that is not between precision and recall.

'samples':

Calculate metrics for each instance, and find their average (only meaningful for multilabel classification where this differs from accuracy_score()).

• sample_weight (array-like of shape (n_samples,), default=None) – Sample weights.

• zero_division ("warn", 0 or 1, default="warn") – Sets the value to return when there is a zero division. If set to “warn”, this acts as 0, but warnings are also raised.

Returns

precision – Precision of the positive class in binary classification or weighted average of the precision of each class for the multiclass task.

Return type

float (if average is not None) or array of float of shape (n_unique_labels,)

precision_recall_fscore_support

Compute precision, recall, F-measure and support for each class.

recall_score

Compute the ratio tp / (tp + fn) where tp is the number of true positives and fn the number of false negatives.

PrecisionRecallDisplay.from_estimator

Plot precision-recall curve given an estimator and some data.

PrecisionRecallDisplay.from_predictions

Plot precision-recall curve given binary class predictions.

multilabel_confusion_matrix

Compute a confusion matrix for each class or sample.

Notes

When true positive + false positive == 0, precision returns 0 and raises UndefinedMetricWarning. This behavior can be modified with zero_division.

Examples

>>> from sklearn.metrics import precision_score
>>> y_true = [0, 1, 2, 0, 1, 2]
>>> y_pred = [0, 2, 1, 0, 0, 1]
>>> precision_score(y_true, y_pred, average='macro')
0.22...
>>> precision_score(y_true, y_pred, average='micro')
0.33...
>>> precision_score(y_true, y_pred, average='weighted')
0.22...
>>> precision_score(y_true, y_pred, average=None)
array([0.66..., 0.        , 0.        ])
>>> y_pred = [0, 0, 0, 0, 0, 0]
>>> precision_score(y_true, y_pred, average=None)
array([0.33..., 0.        , 0.        ])
>>> precision_score(y_true, y_pred, average=None, zero_division=1)
array([0.33..., 1.        , 1.        ])
>>> # multilabel classification
>>> y_true = [[0, 0, 0], [1, 1, 1], [0, 1, 1]]
>>> y_pred = [[0, 0, 0], [1, 1, 1], [1, 1, 0]]
>>> precision_score(y_true, y_pred, average=None)
array([0.5, 1. , 1. ])

precision_recall_curve(y_true, probas_pred, *, pos_label=None, sample_weight=None)[source]

Compute precision-recall pairs for different probability thresholds.

Note: this implementation is restricted to the binary classification task.

The precision is the ratio tp / (tp + fp) where tp is the number of true positives and fp the number of false positives. The precision is intuitively the ability of the classifier not to label as positive a sample that is negative.

The recall is the ratio tp / (tp + fn) where tp is the number of true positives and fn the number of false negatives. The recall is intuitively the ability of the classifier to find all the positive samples.

The last precision and recall values are 1. and 0. respectively and do not have a corresponding threshold. This ensures that the graph starts on the y axis.

Read more in the User Guide.

Parameters
• y_true (ndarray of shape (n_samples,)) – True binary labels. If labels are not either {-1, 1} or {0, 1}, then pos_label should be explicitly given.

• probas_pred (ndarray of shape (n_samples,)) – Target scores, can either be probability estimates of the positive class, or non-thresholded measure of decisions (as returned by decision_function on some classifiers).

• pos_label (int or str, default=None) – The label of the positive class. When pos_label=None, if y_true is in {-1, 1} or {0, 1}, pos_label is set to 1, otherwise an error will be raised.

• sample_weight (array-like of shape (n_samples,), default=None) – Sample weights.

Returns

• precision (ndarray of shape (n_thresholds + 1,)) – Precision values such that element i is the precision of predictions with score >= thresholds[i] and the last element is 1.

• recall (ndarray of shape (n_thresholds + 1,)) – Decreasing recall values such that element i is the recall of predictions with score >= thresholds[i] and the last element is 0.

• thresholds (ndarray of shape (n_thresholds,)) – Increasing thresholds on the decision function used to compute precision and recall. n_thresholds <= len(np.unique(probas_pred)).

PrecisionRecallDisplay.from_estimator

Plot Precision Recall Curve given a binary classifier.

PrecisionRecallDisplay.from_predictions

Plot Precision Recall Curve using predictions from a binary classifier.

average_precision_score

Compute average precision from prediction scores.

det_curve

Compute error rates for different probability thresholds.

roc_curve

Compute Receiver operating characteristic (ROC) curve.

Examples

>>> import numpy as np
>>> from sklearn.metrics import precision_recall_curve
>>> y_true = np.array([0, 0, 1, 1])
>>> y_scores = np.array([0.1, 0.4, 0.35, 0.8])
>>> precision, recall, thresholds = precision_recall_curve(
...     y_true, y_scores)
>>> precision
array([0.66666667, 0.5       , 1.        , 1.        ])
>>> recall
array([1. , 0.5, 0.5, 0. ])
>>> thresholds
array([0.35, 0.4 , 0.8 ])

auc(x, y)[source]

Compute Area Under the Curve (AUC) using the trapezoidal rule.

This is a general function, given points on a curve. For computing the area under the ROC-curve, see roc_auc_score(). For an alternative way to summarize a precision-recall curve, see average_precision_score().

Parameters
• x (ndarray of shape (n,)) – x coordinates. These must be either monotonic increasing or monotonic decreasing.

• y (ndarray of shape, (n,)) – y coordinates.

Returns

auc

Return type

float

roc_auc_score

Compute the area under the ROC curve.

average_precision_score

Compute average precision from prediction scores.

precision_recall_curve

Compute precision-recall pairs for different probability thresholds.

Examples

>>> import numpy as np
>>> from sklearn import metrics
>>> y = np.array([1, 1, 2, 2])
>>> pred = np.array([0.1, 0.4, 0.35, 0.8])
>>> fpr, tpr, thresholds = metrics.roc_curve(y, pred, pos_label=2)
>>> metrics.auc(fpr, tpr)
0.75

jaccard_score(y_true, y_pred, *, labels=None, pos_label=1, average='binary', sample_weight=None, zero_division='warn')[source]

Jaccard similarity coefficient score.

The Jaccard index [1], or Jaccard similarity coefficient, defined as the size of the intersection divided by the size of the union of two label sets, is used to compare set of predicted labels for a sample to the corresponding set of labels in y_true.

Read more in the User Guide.

Parameters
• y_true (1d array-like, or label indicator array / sparse matrix) – Ground truth (correct) labels.

• y_pred (1d array-like, or label indicator array / sparse matrix) – Predicted labels, as returned by a classifier.

• labels (array-like of shape (n_classes,), default=None) – The set of labels to include when average != 'binary', and their order if average is None. Labels present in the data can be excluded, for example to calculate a multiclass average ignoring a majority negative class, while labels not present in the data will result in 0 components in a macro average. For multilabel targets, labels are column indices. By default, all labels in y_true and y_pred are used in sorted order.

• pos_label (str or int, default=1) – The class to report if average='binary' and the data is binary. If the data are multiclass or multilabel, this will be ignored; setting labels=[pos_label] and average != 'binary' will report scores for that label only.

• average ({'micro', 'macro', 'samples', 'weighted', 'binary'} or None, default='binary') –

If None, the scores for each class are returned. Otherwise, this determines the type of averaging performed on the data:

'binary':

Only report results for the class specified by pos_label. This is applicable only if targets (y_{true,pred}) are binary.

'micro':

Calculate metrics globally by counting the total true positives, false negatives and false positives.

'macro':

Calculate metrics for each label, and find their unweighted mean. This does not take label imbalance into account.

'weighted':

Calculate metrics for each label, and find their average, weighted by support (the number of true instances for each label). This alters ‘macro’ to account for label imbalance.

'samples':

Calculate metrics for each instance, and find their average (only meaningful for multilabel classification).

• sample_weight (array-like of shape (n_samples,), default=None) – Sample weights.

• zero_division ("warn", {0.0, 1.0}, default="warn") – Sets the value to return when there is a zero division, i.e. when there there are no negative values in predictions and labels. If set to “warn”, this acts like 0, but a warning is also raised.

Returns

score

Return type

float (if average is not None) or array of floats, shape = [n_unique_labels]

accuracy_score, f1_score, multilabel_confusion_matrix

Notes

jaccard_score() may be a poor metric if there are no positives for some samples or classes. Jaccard is undefined if there are no true or predicted labels, and our implementation will return a score of 0 with a warning.

References

1

Examples

>>> import numpy as np
>>> from sklearn.metrics import jaccard_score
>>> y_true = np.array([[0, 1, 1],
...                    [1, 1, 0]])
>>> y_pred = np.array([[1, 1, 1],
...                    [1, 0, 0]])


In the binary case:

>>> jaccard_score(y_true[0], y_pred[0])
0.6666...


In the multilabel case:

>>> jaccard_score(y_true, y_pred, average='samples')
0.5833...
>>> jaccard_score(y_true, y_pred, average='macro')
0.6666...
>>> jaccard_score(y_true, y_pred, average=None)
array([0.5, 0.5, 1. ])


In the multiclass case:

>>> y_pred = [0, 2, 1, 2]
>>> y_true = [0, 1, 2, 2]
>>> jaccard_score(y_true, y_pred, average=None)
array([1. , 0. , 0.33...])

f1_score(y_true, y_pred, *, labels=None, pos_label=1, average='binary', sample_weight=None, zero_division='warn')[source]

Compute the F1 score, also known as balanced F-score or F-measure.

The F1 score can be interpreted as a harmonic mean of the precision and recall, where an F1 score reaches its best value at 1 and worst score at 0. The relative contribution of precision and recall to the F1 score are equal. The formula for the F1 score is:

F1 = 2 * (precision * recall) / (precision + recall)


In the multi-class and multi-label case, this is the average of the F1 score of each class with weighting depending on the average parameter.

Read more in the User Guide.

Parameters
• y_true (1d array-like, or label indicator array / sparse matrix) – Ground truth (correct) target values.

• y_pred (1d array-like, or label indicator array / sparse matrix) – Estimated targets as returned by a classifier.

• labels (array-like, default=None) –

The set of labels to include when average != 'binary', and their order if average is None. Labels present in the data can be excluded, for example to calculate a multiclass average ignoring a majority negative class, while labels not present in the data will result in 0 components in a macro average. For multilabel targets, labels are column indices. By default, all labels in y_true and y_pred are used in sorted order.

Changed in version 0.17: Parameter labels improved for multiclass problem.

• pos_label (str or int, default=1) – The class to report if average='binary' and the data is binary. If the data are multiclass or multilabel, this will be ignored; setting labels=[pos_label] and average != 'binary' will report scores for that label only.

• average ({'micro', 'macro', 'samples','weighted', 'binary'} or None, default='binary') –

This parameter is required for multiclass/multilabel targets. If None, the scores for each class are returned. Otherwise, this determines the type of averaging performed on the data:

'binary':

Only report results for the class specified by pos_label. This is applicable only if targets (y_{true,pred}) are binary.

'micro':

Calculate metrics globally by counting the total true positives, false negatives and false positives.

'macro':

Calculate metrics for each label, and find their unweighted mean. This does not take label imbalance into account.

'weighted':

Calculate metrics for each label, and find their average weighted by support (the number of true instances for each label). This alters ‘macro’ to account for label imbalance; it can result in an F-score that is not between precision and recall.

'samples':

Calculate metrics for each instance, and find their average (only meaningful for multilabel classification where this differs from accuracy_score()).

• sample_weight (array-like of shape (n_samples,), default=None) – Sample weights.

• zero_division ("warn", 0 or 1, default="warn") – Sets the value to return when there is a zero division, i.e. when all predictions and labels are negative. If set to “warn”, this acts as 0, but warnings are also raised.

Returns

f1_score – F1 score of the positive class in binary classification or weighted average of the F1 scores of each class for the multiclass task.

Return type

float or array of float, shape = [n_unique_labels]

fbeta_score, precision_recall_fscore_support, jaccard_score, multilabel_confusion_matrix

References

1

Examples

>>> from sklearn.metrics import f1_score
>>> y_true = [0, 1, 2, 0, 1, 2]
>>> y_pred = [0, 2, 1, 0, 0, 1]
>>> f1_score(y_true, y_pred, average='macro')
0.26...
>>> f1_score(y_true, y_pred, average='micro')
0.33...
>>> f1_score(y_true, y_pred, average='weighted')
0.26...
>>> f1_score(y_true, y_pred, average=None)
array([0.8, 0. , 0. ])
>>> y_true = [0, 0, 0, 0, 0, 0]
>>> y_pred = [0, 0, 0, 0, 0, 0]
>>> f1_score(y_true, y_pred, zero_division=1)
1.0...
>>> # multilabel classification
>>> y_true = [[0, 0, 0], [1, 1, 1], [0, 1, 1]]
>>> y_pred = [[0, 0, 0], [1, 1, 1], [1, 1, 0]]
>>> f1_score(y_true, y_pred, average=None)
array([0.66666667, 1.        , 0.66666667])


Notes

When true positive + false positive == 0, precision is undefined. When true positive + false negative == 0, recall is undefined. In such cases, by default the metric will be set to 0, as will f-score, and UndefinedMetricWarning will be raised. This behavior can be modified with zero_division.

roc_auc_score(y_true, y_score, *, average='macro', sample_weight=None, max_fpr=None, multi_class='raise', labels=None)[source]

Compute Area Under the Receiver Operating Characteristic Curve (ROC AUC) from prediction scores.

Note: this implementation can be used with binary, multiclass and multilabel classification, but some restrictions apply (see Parameters).

Read more in the User Guide.

Parameters
• y_true (array-like of shape (n_samples,) or (n_samples, n_classes)) – True labels or binary label indicators. The binary and multiclass cases expect labels with shape (n_samples,) while the multilabel case expects binary label indicators with shape (n_samples, n_classes).

• y_score (array-like of shape (n_samples,) or (n_samples, n_classes)) –

Target scores.

• In the binary case, it corresponds to an array of shape (n_samples,). Both probability estimates and non-thresholded decision values can be provided. The probability estimates correspond to the probability of the class with the greater label, i.e. estimator.classes_[1] and thus estimator.predict_proba(X, y)[:, 1]. The decision values corresponds to the output of estimator.decision_function(X, y). See more information in the User guide;

• In the multiclass case, it corresponds to an array of shape (n_samples, n_classes) of probability estimates provided by the predict_proba method. The probability estimates must sum to 1 across the possible classes. In addition, the order of the class scores must correspond to the order of labels, if provided, or else to the numerical or lexicographical order of the labels in y_true. See more information in the User guide;

• In the multilabel case, it corresponds to an array of shape (n_samples, n_classes). Probability estimates are provided by the predict_proba method and the non-thresholded decision values by the decision_function method. The probability estimates correspond to the probability of the class with the greater label for each output of the classifier. See more information in the User guide.

• average ({'micro', 'macro', 'samples', 'weighted'} or None, default='macro') –

If None, the scores for each class are returned. Otherwise, this determines the type of averaging performed on the data: Note: multiclass ROC AUC currently only handles the ‘macro’ and ‘weighted’ averages.

'micro':

Calculate metrics globally by considering each element of the label indicator matrix as a label.

'macro':

Calculate metrics for each label, and find their unweighted mean. This does not take label imbalance into account.

'weighted':

Calculate metrics for each label, and find their average, weighted by support (the number of true instances for each label).

'samples':

Calculate metrics for each instance, and find their average.

Will be ignored when y_true is binary.

• sample_weight (array-like of shape (n_samples,), default=None) – Sample weights.

• max_fpr (float > 0 and <= 1, default=None) – If not None, the standardized partial AUC [2]_ over the range [0, max_fpr] is returned. For the multiclass case, max_fpr, should be either equal to None or 1.0 as AUC ROC partial computation currently is not supported for multiclass.

• multi_class ({'raise', 'ovr', 'ovo'}, default='raise') –

Only used for multiclass targets. Determines the type of configuration to use. The default value raises an error, so either 'ovr' or 'ovo' must be passed explicitly.

'ovr':

Stands for One-vs-rest. Computes the AUC of each class against the rest [3]_ [4]_. This treats the multiclass case in the same way as the multilabel case. Sensitive to class imbalance even when average == 'macro', because class imbalance affects the composition of each of the ‘rest’ groupings.

'ovo':

Stands for One-vs-one. Computes the average AUC of all possible pairwise combinations of classes 5. Insensitive to class imbalance when average == 'macro'.

• labels (array-like of shape (n_classes,), default=None) – Only used for multiclass targets. List of labels that index the classes in y_score. If None, the numerical or lexicographical order of the labels in y_true is used.

Returns

auc

Return type

float

References

1

Wikipedia entry for the Receiver operating characteristic

2

Analyzing a portion of the ROC curve. McClish, 1989

3

Provost, F., Domingos, P. (2000). Well-trained PETs: Improving probability estimation trees (Section 6.2), CeDER Working Paper #IS-00-04, Stern School of Business, New York University.

4

Fawcett, T. (2006). An introduction to ROC analysis. Pattern Recognition Letters, 27(8), 861-874.

5

Hand, D.J., Till, R.J. (2001). A Simple Generalisation of the Area Under the ROC Curve for Multiple Class Classification Problems. Machine Learning, 45(2), 171-186.

average_precision_score

Area under the precision-recall curve.

roc_curve

Compute Receiver operating characteristic (ROC) curve.

RocCurveDisplay.from_estimator

Plot Receiver Operating Characteristic (ROC) curve given an estimator and some data.

RocCurveDisplay.from_predictions

Plot Receiver Operating Characteristic (ROC) curve given the true and predicted values.

Examples

Binary case:

>>> from sklearn.datasets import load_breast_cancer
>>> from sklearn.linear_model import LogisticRegression
>>> from sklearn.metrics import roc_auc_score
>>> clf = LogisticRegression(solver="liblinear", random_state=0).fit(X, y)
>>> roc_auc_score(y, clf.predict_proba(X)[:, 1])
0.99...
>>> roc_auc_score(y, clf.decision_function(X))
0.99...


Multiclass case:

>>> from sklearn.datasets import load_iris
>>> clf = LogisticRegression(solver="liblinear").fit(X, y)
>>> roc_auc_score(y, clf.predict_proba(X), multi_class='ovr')
0.99...


Multilabel case:

>>> import numpy as np
>>> from sklearn.datasets import make_multilabel_classification
>>> from sklearn.multioutput import MultiOutputClassifier
>>> X, y = make_multilabel_classification(random_state=0)
>>> clf = MultiOutputClassifier(clf).fit(X, y)
>>> # get a list of n_output containing probability arrays of shape
>>> # (n_samples, n_classes)
>>> y_pred = clf.predict_proba(X)
>>> # extract the positive columns for each output
>>> y_pred = np.transpose([pred[:, 1] for pred in y_pred])
>>> roc_auc_score(y, y_pred, average=None)
array([0.82..., 0.86..., 0.94..., 0.85... , 0.94...])
>>> from sklearn.linear_model import RidgeClassifierCV
>>> clf = RidgeClassifierCV().fit(X, y)
>>> roc_auc_score(y, clf.decision_function(X), average=None)
array([0.81..., 0.84... , 0.93..., 0.87..., 0.94...])

accuracy_score(y_true, y_pred, *, normalize=True, sample_weight=None)[source]

Accuracy classification score.

In multilabel classification, this function computes subset accuracy: the set of labels predicted for a sample must exactly match the corresponding set of labels in y_true.

Read more in the User Guide.

Parameters
• y_true (1d array-like, or label indicator array / sparse matrix) – Ground truth (correct) labels.

• y_pred (1d array-like, or label indicator array / sparse matrix) – Predicted labels, as returned by a classifier.

• normalize (bool, default=True) – If False, return the number of correctly classified samples. Otherwise, return the fraction of correctly classified samples.

• sample_weight (array-like of shape (n_samples,), default=None) – Sample weights.

Returns

score – If normalize == True, return the fraction of correctly classified samples (float), else returns the number of correctly classified samples (int).

The best performance is 1 with normalize == True and the number of samples with normalize == False.

Return type

float

balanced_accuracy_score

Compute the balanced accuracy to deal with imbalanced datasets.

jaccard_score

Compute the Jaccard similarity coefficient score.

hamming_loss

Compute the average Hamming loss or Hamming distance between two sets of samples.

zero_one_loss

Compute the Zero-one classification loss. By default, the function will return the percentage of imperfectly predicted subsets.

Notes

In binary classification, this function is equal to the jaccard_score function.

Examples

>>> from sklearn.metrics import accuracy_score
>>> y_pred = [0, 2, 1, 3]
>>> y_true = [0, 1, 2, 3]
>>> accuracy_score(y_true, y_pred)
0.5
>>> accuracy_score(y_true, y_pred, normalize=False)
2


In the multilabel case with binary label indicators:

>>> import numpy as np
>>> accuracy_score(np.array([[0, 1], [1, 1]]), np.ones((2, 2)))
0.5

balanced_accuracy_score(y_true, y_pred, *, sample_weight=None, adjusted=False)[source]

Compute the balanced accuracy.

The balanced accuracy in binary and multiclass classification problems to deal with imbalanced datasets. It is defined as the average of recall obtained on each class.

The best value is 1 and the worst value is 0 when adjusted=False.

Read more in the User Guide.

New in version 0.20.

Parameters
• y_true (1d array-like) – Ground truth (correct) target values.

• y_pred (1d array-like) – Estimated targets as returned by a classifier.

• sample_weight (array-like of shape (n_samples,), default=None) – Sample weights.

• adjusted (bool, default=False) – When true, the result is adjusted for chance, so that random performance would score 0, while keeping perfect performance at a score of 1.

Returns

balanced_accuracy – Balanced accuracy score.

Return type

float

average_precision_score

Compute average precision (AP) from prediction scores.

precision_score

Compute the precision score.

recall_score

Compute the recall score.

roc_auc_score

Compute Area Under the Receiver Operating Characteristic Curve (ROC AUC) from prediction scores.

Notes

Some literature promotes alternative definitions of balanced accuracy. Our definition is equivalent to accuracy_score() with class-balanced sample weights, and shares desirable properties with the binary case. See the User Guide.

References

1

Brodersen, K.H.; Ong, C.S.; Stephan, K.E.; Buhmann, J.M. (2010). The balanced accuracy and its posterior distribution. Proceedings of the 20th International Conference on Pattern Recognition, 3121-24.

2

John. D. Kelleher, Brian Mac Namee, Aoife D’Arcy, (2015). Fundamentals of Machine Learning for Predictive Data Analytics: Algorithms, Worked Examples, and Case Studies.

Examples

>>> from sklearn.metrics import balanced_accuracy_score
>>> y_true = [0, 1, 0, 0, 1, 0]
>>> y_pred = [0, 1, 0, 0, 0, 1]
>>> balanced_accuracy_score(y_true, y_pred)
0.625

top_k_accuracy_score(y_true, y_score, *, k=2, normalize=True, sample_weight=None, labels=None)[source]

Top-k Accuracy classification score.

This metric computes the number of times where the correct label is among the top k labels predicted (ranked by predicted scores). Note that the multilabel case isn’t covered here.

Read more in the User Guide

Parameters
• y_true (array-like of shape (n_samples,)) – True labels.

• y_score (array-like of shape (n_samples,) or (n_samples, n_classes)) – Target scores. These can be either probability estimates or non-thresholded decision values (as returned by decision_function on some classifiers). The binary case expects scores with shape (n_samples,) while the multiclass case expects scores with shape (n_samples, n_classes). In the multiclass case, the order of the class scores must correspond to the order of labels, if provided, or else to the numerical or lexicographical order of the labels in y_true. If y_true does not contain all the labels, labels must be provided.

• k (int, default=2) – Number of most likely outcomes considered to find the correct label.

• normalize (bool, default=True) – If True, return the fraction of correctly classified samples. Otherwise, return the number of correctly classified samples.

• sample_weight (array-like of shape (n_samples,), default=None) – Sample weights. If None, all samples are given the same weight.

• labels (array-like of shape (n_classes,), default=None) – Multiclass only. List of labels that index the classes in y_score. If None, the numerical or lexicographical order of the labels in y_true is used. If y_true does not contain all the labels, labels must be provided.

Returns

score – The top-k accuracy score. The best performance is 1 with normalize == True and the number of samples with normalize == False.

Return type

float

Notes

In cases where two or more labels are assigned equal predicted scores, the labels with the highest indices will be chosen first. This might impact the result if the correct label falls after the threshold because of that.

Examples

>>> import numpy as np
>>> from sklearn.metrics import top_k_accuracy_score
>>> y_true = np.array([0, 1, 2, 2])
>>> y_score = np.array([[0.5, 0.2, 0.2],  # 0 is in top 2
...                     [0.3, 0.4, 0.2],  # 1 is in top 2
...                     [0.2, 0.4, 0.3],  # 2 is in top 2
...                     [0.7, 0.2, 0.1]]) # 2 isn't in top 2
>>> top_k_accuracy_score(y_true, y_score, k=2)
0.75
>>> # Not normalizing gives the number of "correctly" classified samples
>>> top_k_accuracy_score(y_true, y_score, k=2, normalize=False)
3

pearson_r2_score(y: numpy.ndarray, y_pred: numpy.ndarray)float[source]

Computes Pearson R^2 (square of Pearson correlation).

Parameters
• y (np.ndarray) – ground truth array

• y_pred (np.ndarray) – predicted array

Returns

The Pearson-R^2 score.

Return type

float

jaccard_index(y: numpy.ndarray, y_pred: numpy.ndarray)float[source]

Computes Jaccard Index which is the Intersection Over Union metric which is commonly used in image segmentation tasks.

DEPRECATED: WILL BE REMOVED IN A FUTURE VERSION OF DEEEPCHEM. USE jaccard_score instead.

Parameters
• y (np.ndarray) – ground truth array

• y_pred (np.ndarray) – predicted array

Returns

score – The jaccard index. A number between 0 and 1.

Return type

float

pixel_error(y: numpy.ndarray, y_pred: numpy.ndarray)float[source]

An error metric in case y, y_pred are images.

Defined as 1 - the maximal F-score of pixel similarity, or squared Euclidean distance between the original and the result labels.

Parameters
• y (np.ndarray) – ground truth array

• y_pred (np.ndarray) – predicted array

Returns

score – The pixel-error. A number between 0 and 1.

Return type

float

prc_auc_score(y: numpy.ndarray, y_pred: numpy.ndarray)float[source]

Compute area under precision-recall curve

Parameters
• y (np.ndarray) – A numpy array of shape (N, n_classes) or (N,) with true labels

• y_pred (np.ndarray) – Of shape (N, n_classes) with class probabilities.

Returns

The area under the precision-recall curve. A number between 0 and 1.

Return type

float

rms_score(y_true: numpy.ndarray, y_pred: numpy.ndarray)float[source]

Computes RMS error.

mae_score(y_true: numpy.ndarray, y_pred: numpy.ndarray)float[source]

Computes MAE.

kappa_score(y1, y2, *, labels=None, weights=None, sample_weight=None)[source]

Cohen’s kappa: a statistic that measures inter-annotator agreement.

This function computes Cohen’s kappa [1]_, a score that expresses the level of agreement between two annotators on a classification problem. It is defined as

$\kappa = (p_o - p_e) / (1 - p_e)$

where $$p_o$$ is the empirical probability of agreement on the label assigned to any sample (the observed agreement ratio), and $$p_e$$ is the expected agreement when both annotators assign labels randomly. $$p_e$$ is estimated using a per-annotator empirical prior over the class labels [2]_.

Read more in the User Guide.

Parameters
• y1 (array of shape (n_samples,)) – Labels assigned by the first annotator.

• y2 (array of shape (n_samples,)) – Labels assigned by the second annotator. The kappa statistic is symmetric, so swapping y1 and y2 doesn’t change the value.

• labels (array-like of shape (n_classes,), default=None) – List of labels to index the matrix. This may be used to select a subset of labels. If None, all labels that appear at least once in y1 or y2 are used.

• weights ({'linear', 'quadratic'}, default=None) – Weighting type to calculate the score. None means no weighted; “linear” means linear weighted; “quadratic” means quadratic weighted.

• sample_weight (array-like of shape (n_samples,), default=None) – Sample weights.

Returns

kappa – The kappa statistic, which is a number between -1 and 1. The maximum value means complete agreement; zero or lower means chance agreement.

Return type

float

References

1

J. Cohen (1960). “A coefficient of agreement for nominal scales”. Educational and Psychological Measurement 20(1):37-46. doi:10.1177/001316446002000104.

2
3
bedroc_score(y_true: numpy.ndarray, y_pred: numpy.ndarray, alpha: float = 20.0)[source]

Compute BEDROC metric.

BEDROC metric implemented according to Truchon and Bayley that modifies the ROC score by allowing for a factor of early recognition. Please confirm details from [1]_.

Parameters
• y_true (np.ndarray) – Binary class labels. 1 for positive class, 0 otherwise

• y_pred (np.ndarray) – Predicted labels

• alpha (float, default 20.0) – Early recognition parameter

Returns

Value in [0, 1] that indicates the degree of early recognition

Return type

float

Notes

This function requires RDKit to be installed.

References

1

Truchon et al. “Evaluating virtual screening methods: good and bad metrics for the “early recognition” problem.” Journal of chemical information and modeling 47.2 (2007): 488-508.

concordance_index(y_true: numpy.ndarray, y_pred: numpy.ndarray)float[source]

Compute Concordance index.

Statistical metric indicates the quality of the predicted ranking. Please confirm details from [1]_.

Parameters
• y_true (np.ndarray) – continous value

• y_pred (np.ndarray) – Predicted value

Returns

score between [0,1]

Return type

float

References

1

Steck, Harald, et al. “On ranking in survival analysis: Bounds on the concordance index.” Advances in neural information processing systems (2008): 1209-1216.

get_motif_scores(encoded_sequences: numpy.ndarray, motif_names: List[str], max_scores: Optional[int] = None, return_positions: bool = False, GC_fraction: float = 0.4)numpy.ndarray[source]

Computes pwm log odds.

Parameters
• encoded_sequences (np.ndarray) – A numpy array of shape (N_sequences, N_letters, sequence_length, 1).

• motif_names (List[str]) – List of motif file names.

• max_scores (int, optional) – Get top max_scores scores.

• return_positions (bool, default False) – Whether to return postions or not.

• GC_fraction (float, default 0.4) – GC fraction in background sequence.

Returns

A numpy array of complete score. The shape is (N_sequences, num_motifs, seq_length) by default. If max_scores, the shape of score array is (N_sequences, num_motifs*max_scores). If max_scores and return_positions, the shape of score array with max scores and their positions. is (N_sequences, 2*num_motifs*max_scores).

Return type

np.ndarray

Notes

This method requires simdna to be installed.

get_pssm_scores(encoded_sequences: numpy.ndarray, pssm: numpy.ndarray)numpy.ndarray[source]

Convolves pssm and its reverse complement with encoded sequences and returns the maximum score at each position of each sequence.

Parameters
• encoded_sequences (np.ndarray) – A numpy array of shape (N_sequences, N_letters, sequence_length, 1).

• pssm (np.ndarray) – A numpy array of shape (4, pssm_length).

Returns

scores – A numpy array of shape (N_sequences, sequence_length).

Return type

np.ndarray

in_silico_mutagenesis(model: deepchem.models.models.Model, encoded_sequences: numpy.ndarray)numpy.ndarray[source]

Computes in-silico-mutagenesis scores

Parameters
• model (Model) – This can be any model that accepts inputs of the required shape and produces an output of shape (N_sequences, N_tasks).

• encoded_sequences (np.ndarray) – A numpy array of shape (N_sequences, N_letters, sequence_length, 1)

Returns

A numpy array of ISM scores. The shape is (num_task, N_sequences, N_letters, sequence_length, 1).

Return type

np.ndarray

## Metric Class¶

The dc.metrics.Metric class is a wrapper around metric functions which interoperates with DeepChem dc.models.Model.

class Metric(metric: Callable[[], float], task_averager: Optional[Callable[[], Any]] = None, name: Optional[str] = None, threshold: Optional[float] = None, mode: Optional[str] = None, n_tasks: Optional[int] = None, classification_handling_mode: Optional[str] = None, threshold_value: Optional[float] = None)[source]

Wrapper class for computing user-defined metrics.

The Metric class provides a wrapper for standardizing the API around different classes of metrics that may be useful for DeepChem models. The implementation provides a few non-standard conveniences such as built-in support for multitask and multiclass metrics.

There are a variety of different metrics this class aims to support. Metrics for classification and regression that assume that values to compare are scalars are supported.

At present, this class doesn’t support metric computation on models which don’t present scalar outputs. For example, if you have a generative model which predicts images or molecules, you will need to write a custom evaluation and metric setup.

__init__(metric: Callable[[], float], task_averager: Optional[Callable[[], Any]] = None, name: Optional[str] = None, threshold: Optional[float] = None, mode: Optional[str] = None, n_tasks: Optional[int] = None, classification_handling_mode: Optional[str] = None, threshold_value: Optional[float] = None)[source]
Parameters
• metric (function) – Function that takes args y_true, y_pred (in that order) and computes desired score. If sample weights are to be considered, metric may take in an additional keyword argument sample_weight.

• task_averager (function, default None) – If not None, should be a function that averages metrics across tasks.

• name (str, default None) – Name of this metric

• threshold (float, default None (DEPRECATED)) – Used for binary metrics and is the threshold for the positive class.

• mode (str, default None) – Should usually be “classification” or “regression.”

• n_tasks (int, default None) – The number of tasks this class is expected to handle.

• classification_handling_mode (str, default None) –

DeepChem models by default predict class probabilities for classification problems. This means that for a given singletask prediction, after shape normalization, the DeepChem labels and prediction will be numpy arrays of shape (n_samples, n_tasks, n_classes) with class probabilities. classification_handling_mode is a string that instructs this method how to handle transforming these probabilities. It can take on the following values: - “direct”: Pass y_true and y_pred directy into self.metric. - “threshold”: Use threshold_predictions to threshold y_true and y_pred.

Use threshold_value as the desired threshold. This converts them into arrays of shape (n_samples, n_tasks), where each element is a class index.

• ”threshold-one-hot”: Use threshold_predictions to threshold y_true and y_pred using threshold_values, then apply to_one_hot to output.

• None: Select a mode automatically based on the metric.

• threshold_value (float, default None) – If set, and classification_handling_mode is “threshold” or “threshold-one-hot”, apply a thresholding operation to values with this threshold. This option is only sensible on binary classification tasks. For multiclass problems, or if threshold_value is None, argmax() is used to select the highest probability class for each task.

compute_metric(y_true: Union[Sequence[Sequence[Sequence[Sequence[Sequence[Any]]]]], numpy.typing._array_like._SupportsArray[numpy.dtype], Sequence[numpy.typing._array_like._SupportsArray[numpy.dtype]], Sequence[Sequence[numpy.typing._array_like._SupportsArray[numpy.dtype]]], Sequence[Sequence[Sequence[numpy.typing._array_like._SupportsArray[numpy.dtype]]]], Sequence[Sequence[Sequence[Sequence[numpy.typing._array_like._SupportsArray[numpy.dtype]]]]], bool, int, float, complex, str, bytes, Sequence[Union[bool, int, float, complex, str, bytes]], Sequence[Sequence[Union[bool, int, float, complex, str, bytes]]], Sequence[Sequence[Sequence[Union[bool, int, float, complex, str, bytes]]]], Sequence[Sequence[Sequence[Sequence[Union[bool, int, float, complex, str, bytes]]]]]], y_pred: Union[Sequence[Sequence[Sequence[Sequence[Sequence[Any]]]]], numpy.typing._array_like._SupportsArray[numpy.dtype], Sequence[numpy.typing._array_like._SupportsArray[numpy.dtype]], Sequence[Sequence[numpy.typing._array_like._SupportsArray[numpy.dtype]]], Sequence[Sequence[Sequence[numpy.typing._array_like._SupportsArray[numpy.dtype]]]], Sequence[Sequence[Sequence[Sequence[numpy.typing._array_like._SupportsArray[numpy.dtype]]]]], bool, int, float, complex, str, bytes, Sequence[Union[bool, int, float, complex, str, bytes]], Sequence[Sequence[Union[bool, int, float, complex, str, bytes]]], Sequence[Sequence[Sequence[Union[bool, int, float, complex, str, bytes]]]], Sequence[Sequence[Sequence[Sequence[Union[bool, int, float, complex, str, bytes]]]]]], w: Optional[Union[Sequence[Sequence[Sequence[Sequence[Sequence[Any]]]]], numpy.typing._array_like._SupportsArray[numpy.dtype], Sequence[numpy.typing._array_like._SupportsArray[numpy.dtype]], Sequence[Sequence[numpy.typing._array_like._SupportsArray[numpy.dtype]]], Sequence[Sequence[Sequence[numpy.typing._array_like._SupportsArray[numpy.dtype]]]], Sequence[Sequence[Sequence[Sequence[numpy.typing._array_like._SupportsArray[numpy.dtype]]]]], bool, int, float, complex, str, bytes, Sequence[Union[bool, int, float, complex, str, bytes]], Sequence[Sequence[Union[bool, int, float, complex, str, bytes]]], Sequence[Sequence[Sequence[Union[bool, int, float, complex, str, bytes]]]], Sequence[Sequence[Sequence[Sequence[Union[bool, int, float, complex, str, bytes]]]]]]] = None, n_tasks: Optional[int] = None, n_classes: int = 2, per_task_metrics: bool = False, use_sample_weights: bool = False, **kwargs)Any[source]

Compute a performance metric for each task.

Parameters
• y_true (ArrayLike) – An ArrayLike containing true values for each task. Must be of shape (N,) or (N, n_tasks) or (N, n_tasks, n_classes) if a classification metric. If of shape (N, n_tasks) values can either be class-labels or probabilities of the positive class for binary classification problems. If a regression problem, must be of shape (N,) or (N, n_tasks) or (N, n_tasks, 1) if a regression metric.

• y_pred (ArrayLike) – An ArrayLike containing predicted values for each task. Must be of shape (N, n_tasks, n_classes) if a classification metric, else must be of shape (N, n_tasks) if a regression metric.

• w (ArrayLike, default None) – An ArrayLike containing weights for each datapoint. If specified, must be of shape (N, n_tasks).

• n_tasks (int, default None) – The number of tasks this class is expected to handle.

• n_classes (int, default 2) – Number of classes in data for classification tasks.

• per_task_metrics (bool, default False) – If true, return computed metric for each task on multitask dataset.

• use_sample_weights (bool, default False) – If set, use per-sample weights w.

• kwargs (dict) – Will be passed on to self.metric

Returns

A numpy array containing metric values for each task.

Return type

np.ndarray

compute_singletask_metric(y_true: Union[Sequence[Sequence[Sequence[Sequence[Sequence[Any]]]]], numpy.typing._array_like._SupportsArray[numpy.dtype], Sequence[numpy.typing._array_like._SupportsArray[numpy.dtype]], Sequence[Sequence[numpy.typing._array_like._SupportsArray[numpy.dtype]]], Sequence[Sequence[Sequence[numpy.typing._array_like._SupportsArray[numpy.dtype]]]], Sequence[Sequence[Sequence[Sequence[numpy.typing._array_like._SupportsArray[numpy.dtype]]]]], bool, int, float, complex, str, bytes, Sequence[Union[bool, int, float, complex, str, bytes]], Sequence[Sequence[Union[bool, int, float, complex, str, bytes]]], Sequence[Sequence[Sequence[Union[bool, int, float, complex, str, bytes]]]], Sequence[Sequence[Sequence[Sequence[Union[bool, int, float, complex, str, bytes]]]]]], y_pred: Union[Sequence[Sequence[Sequence[Sequence[Sequence[Any]]]]], numpy.typing._array_like._SupportsArray[numpy.dtype], Sequence[numpy.typing._array_like._SupportsArray[numpy.dtype]], Sequence[Sequence[numpy.typing._array_like._SupportsArray[numpy.dtype]]], Sequence[Sequence[Sequence[numpy.typing._array_like._SupportsArray[numpy.dtype]]]], Sequence[Sequence[Sequence[Sequence[numpy.typing._array_like._SupportsArray[numpy.dtype]]]]], bool, int, float, complex, str, bytes, Sequence[Union[bool, int, float, complex, str, bytes]], Sequence[Sequence[Union[bool, int, float, complex, str, bytes]]], Sequence[Sequence[Sequence[Union[bool, int, float, complex, str, bytes]]]], Sequence[Sequence[Sequence[Sequence[Union[bool, int, float, complex, str, bytes]]]]]], w: Optional[Union[Sequence[Sequence[Sequence[Sequence[Sequence[Any]]]]], numpy.typing._array_like._SupportsArray[numpy.dtype], Sequence[numpy.typing._array_like._SupportsArray[numpy.dtype]], Sequence[Sequence[numpy.typing._array_like._SupportsArray[numpy.dtype]]], Sequence[Sequence[Sequence[numpy.typing._array_like._SupportsArray[numpy.dtype]]]], Sequence[Sequence[Sequence[Sequence[numpy.typing._array_like._SupportsArray[numpy.dtype]]]]], bool, int, float, complex, str, bytes, Sequence[Union[bool, int, float, complex, str, bytes]], Sequence[Sequence[Union[bool, int, float, complex, str, bytes]]], Sequence[Sequence[Sequence[Union[bool, int, float, complex, str, bytes]]]], Sequence[Sequence[Sequence[Sequence[Union[bool, int, float, complex, str, bytes]]]]]]] = None, n_samples: Optional[int] = None, use_sample_weights: bool = False, **kwargs)float[source]

Compute a metric value.

Parameters
• y_true (ArrayLike) – True values array. This array must be of shape (N, n_classes) if classification and (N,) if regression.

• y_pred (ArrayLike) – Predictions array. This array must be of shape (N, n_classes) if classification and (N,) if regression.

• w (ArrayLike, default None) – Sample weight array. This array must be of shape (N,)

• n_samples (int, default None (DEPRECATED)) – The number of samples in the dataset. This is N. This argument is ignored.

• use_sample_weights (bool, default False) – If set, use per-sample weights w.

• kwargs (dict) – Will be passed on to self.metric

Returns

metric_value – The computed value of the metric.

Return type

float