Metrics

Metrics are one of the most important parts of machine learning. Unlike traditional software, in which algorithms either work or don’t work, machine learning models work in degrees. That is, there’s a continuous range of “goodness” for a model. “Metrics” are functions which measure how well a model works. There are many different choices of metrics depending on the type of model at hand.

Metric Utilities

Metric utility functions allow for some common manipulations such as switching to/from one-hot representations.

to_one_hot(y: ndarray, n_classes: int = 2) ndarray[source]

Transforms label vector into one-hot encoding.

Turns y into vector of shape (N, n_classes) with a one-hot encoding. Assumes that y takes values from 0 to n_classes - 1.

Parameters:
  • y (np.ndarray) – A vector of shape (N,) or (N, 1)

  • n_classes (int, default 2) – If specified use this as the number of classes. Else will try to impute it as n_classes = max(y) + 1 for arrays and as n_classes=2 for the case of scalars. Note this parameter only has value if mode==”classification”

Returns:

A numpy array of shape (N, n_classes).

Return type:

np.ndarray

from_one_hot(y: ndarray, axis: int = 1) ndarray[source]

Transforms label vector from one-hot encoding.

Parameters:
  • y (np.ndarray) – A vector of shape (n_samples, num_classes)

  • axis (int, optional (default 1)) – The axis with one-hot encodings to reduce on.

Returns:

A numpy array of shape (n_samples,)

Return type:

np.ndarray

Metric Shape Handling

One of the trickiest parts of handling metrics correctly is making sure the shapes of input weights, predictions and labels and processed correctly. This is challenging in particular since DeepChem supports multitask, multiclass models which means that shapes must be handled with care to prevent errors. DeepChem maintains the following utility functions which attempt to facilitate shape handling for you.

normalize_weight_shape(w: ndarray | None, n_samples: int, n_tasks: int) ndarray[source]

A utility function to correct the shape of the weight array.

This utility function is used to normalize the shapes of a given weight array.

Parameters:
  • w (np.ndarray) – w can be None or a scalar or a np.ndarray of shape (n_samples,) or of shape (n_samples, n_tasks). If w is a scalar, it’s assumed to be the same weight for all samples/tasks.

  • n_samples (int) – The number of samples in the dataset. If w is not None, we should have n_samples = w.shape[0] if w is a ndarray

  • n_tasks (int) – The number of tasks. If w is 2d ndarray, then we should have w.shape[1] == n_tasks.

Examples

>>> import numpy as np
>>> w_out = normalize_weight_shape(None, n_samples=10, n_tasks=1)
>>> (w_out == np.ones((10, 1))).all()
True
Returns:

w_out – Array of shape (n_samples, n_tasks)

Return type:

np.ndarray

normalize_labels_shape(y: ndarray, mode: str | None = None, n_tasks: int | None = None, n_classes: int | None = None) ndarray[source]

A utility function to correct the shape of the labels.

Parameters:
  • y (np.ndarray) – y is an array of shape (N,) or (N, n_tasks) or (N, n_tasks, 1).

  • mode (str, default None) – If mode is “classification” or “regression”, attempts to apply data transformations.

  • n_tasks (int, default None) – The number of tasks this class is expected to handle.

  • n_classes (int, default None) – If specified use this as the number of classes. Else will try to impute it as n_classes = max(y) + 1 for arrays and as n_classes=2 for the case of scalars. Note this parameter only has value if mode==”classification”

Returns:

y_out – If mode==”classification”, y_out is an array of shape (N, n_tasks, n_classes). If mode==”regression”, y_out is an array of shape (N, n_tasks).

Return type:

np.ndarray

normalize_prediction_shape(y: ndarray, mode: str | None = None, n_tasks: int | None = None, n_classes: int | None = None)[source]

A utility function to correct the shape of provided predictions.

The metric computation classes expect that inputs for classification have the uniform shape (N, n_tasks, n_classes) and inputs for regression have the uniform shape (N, n_tasks). This function normalizes the provided input array to have the desired shape.

Examples

>>> import numpy as np
>>> y = np.random.rand(10)
>>> y_out = normalize_prediction_shape(y, "regression", n_tasks=1)
>>> y_out.shape
(10, 1)
Parameters:
  • y (np.ndarray) – If mode==”classification”, y is an array of shape (N,) or (N, n_tasks) or (N, n_tasks, n_classes). If mode==”regression”, y is an array of shape (N,) or (N, n_tasks)`or `(N, n_tasks, 1).

  • mode (str, default None) – If mode is “classification” or “regression”, attempts to apply data transformations.

  • n_tasks (int, default None) – The number of tasks this class is expected to handle.

  • n_classes (int, default None) – If specified use this as the number of classes. Else will try to impute it as n_classes = max(y) + 1 for arrays and as n_classes=2 for the case of scalars. Note this parameter only has value if mode==”classification”

Returns:

y_out – If mode==”classification”, y_out is an array of shape (N, n_tasks, n_classes). If mode==”regression”, y_out is an array of shape (N, n_tasks).

Return type:

np.ndarray

handle_classification_mode(y: ndarray, classification_handling_mode: str | None, threshold_value: float | None = None) ndarray[source]

Handle classification mode.

Transform predictions so that they have the correct classification mode.

Parameters:
  • y (np.ndarray) – Must be of shape (N, n_tasks, n_classes)

  • classification_handling_mode (str, default None) –

    DeepChem models by default predict class probabilities for classification problems. This means that for a given singletask prediction, after shape normalization, the DeepChem prediction will be a numpy array of shape (N, n_classes) with class probabilities. classification_handling_mode is a string that instructs this method how to handle transforming these probabilities. It can take on the following values: - None: default value. Pass in y_pred directy into self.metric. - “threshold”: Use threshold_predictions to threshold y_pred. Use

    threshold_value as the desired threshold.

    • ”threshold-one-hot”: Use threshold_predictions to threshold y_pred using threshold_values, then apply to_one_hot to output.

  • threshold_value (float, default None) – If set, and classification_handling_mode is “threshold” or “threshold-one-hot” apply a thresholding operation to values with this threshold. This option isj only sensible on binary classification tasks. If float, this will be applied as a binary classification value.

Returns:

y_out – If classification_handling_mode is “direct”, then of shape (N, n_tasks, n_classes). If classification_handling_mode is “threshold”, then of shape (N, n_tasks). If `classification_handling_mode is “threshold-one-hot”, then of shape `(N, n_tasks, n_classes)”

Return type:

np.ndarray

Metric Functions

DeepChem has a variety of different metrics which are useful for measuring model performance. A number (but not all) of these metrics are directly sourced from sklearn.

matthews_corrcoef(y_true, y_pred, *, sample_weight=None)[source]

Compute the Matthews correlation coefficient (MCC).

The Matthews correlation coefficient is used in machine learning as a measure of the quality of binary and multiclass classifications. It takes into account true and false positives and negatives and is generally regarded as a balanced measure which can be used even if the classes are of very different sizes. The MCC is in essence a correlation coefficient value between -1 and +1. A coefficient of +1 represents a perfect prediction, 0 an average random prediction and -1 an inverse prediction. The statistic is also known as the phi coefficient. [source: Wikipedia]

Binary and multiclass labels are supported. Only in the binary case does this relate to information about true and false positives and negatives. See references below.

Read more in the User Guide.

Parameters:
  • y_true (array-like of shape (n_samples,)) – Ground truth (correct) target values.

  • y_pred (array-like of shape (n_samples,)) – Estimated targets as returned by a classifier.

  • sample_weight (array-like of shape (n_samples,), default=None) –

    Sample weights.

    New in version 0.18.

Returns:

mcc – The Matthews correlation coefficient (+1 represents a perfect prediction, 0 an average random prediction and -1 and inverse prediction).

Return type:

float

References

Examples

>>> from sklearn.metrics import matthews_corrcoef
>>> y_true = [+1, +1, +1, -1]
>>> y_pred = [+1, -1, +1, +1]
>>> matthews_corrcoef(y_true, y_pred)
-0.33...
recall_score(y_true, y_pred, *, labels=None, pos_label=1, average='binary', sample_weight=None, zero_division='warn')[source]

Compute the recall.

The recall is the ratio tp / (tp + fn) where tp is the number of true positives and fn the number of false negatives. The recall is intuitively the ability of the classifier to find all the positive samples.

The best value is 1 and the worst value is 0.

Support beyond term:binary targets is achieved by treating multiclass and multilabel data as a collection of binary problems, one for each label. For the binary case, setting average=’binary’ will return recall for pos_label. If average is not ‘binary’, pos_label is ignored and recall for both classes are computed then averaged or both returned (when average=None). Similarly, for multiclass and multilabel targets, recall for all labels are either returned or averaged depending on the average parameter. Use labels specify the set of labels to calculate recall for.

Read more in the User Guide.

Parameters:
  • y_true (1d array-like, or label indicator array / sparse matrix) – Ground truth (correct) target values.

  • y_pred (1d array-like, or label indicator array / sparse matrix) – Estimated targets as returned by a classifier.

  • labels (array-like, default=None) –

    The set of labels to include when average != ‘binary’, and their order if average is None. Labels present in the data can be excluded, for example in multiclass classification to exclude a “negative class”. Labels not present in the data can be included and will be “assigned” 0 samples. For multilabel targets, labels are column indices. By default, all labels in y_true and y_pred are used in sorted order.

    Changed in version 0.17: Parameter labels improved for multiclass problem.

  • pos_label (int, float, bool or str, default=1) – The class to report if average=’binary’ and the data is binary, otherwise this parameter is ignored. For multiclass or multilabel targets, set labels=[pos_label] and average != ‘binary’ to report metrics for one label only.

  • average ({'micro', 'macro', 'samples', 'weighted', 'binary'} or None, default='binary') –

    This parameter is required for multiclass/multilabel targets. If None, the scores for each class are returned. Otherwise, this determines the type of averaging performed on the data:

    'binary':

    Only report results for the class specified by pos_label. This is applicable only if targets (y_{true,pred}) are binary.

    'micro':

    Calculate metrics globally by counting the total true positives, false negatives and false positives.

    'macro':

    Calculate metrics for each label, and find their unweighted mean. This does not take label imbalance into account.

    'weighted':

    Calculate metrics for each label, and find their average weighted by support (the number of true instances for each label). This alters ‘macro’ to account for label imbalance; it can result in an F-score that is not between precision and recall. Weighted recall is equal to accuracy.

    'samples':

    Calculate metrics for each instance, and find their average (only meaningful for multilabel classification where this differs from accuracy_score()).

  • sample_weight (array-like of shape (n_samples,), default=None) – Sample weights.

  • zero_division ({"warn", 0.0, 1.0, np.nan}, default="warn") –

    Sets the value to return when there is a zero division.

    Notes: - If set to “warn”, this acts like 0, but a warning is also raised. - If set to np.nan, such values will be excluded from the average.

    New in version 1.3: np.nan option was added.

Returns:

recall – Recall of the positive class in binary classification or weighted average of the recall of each class for the multiclass task.

Return type:

float (if average is not None) or array of float of shape (n_unique_labels,)

See also

precision_recall_fscore_support

Compute precision, recall, F-measure and support for each class.

precision_score

Compute the ratio tp / (tp + fp) where tp is the number of true positives and fp the number of false positives.

balanced_accuracy_score

Compute balanced accuracy to deal with imbalanced datasets.

multilabel_confusion_matrix

Compute a confusion matrix for each class or sample.

PrecisionRecallDisplay.from_estimator

Plot precision-recall curve given an estimator and some data.

PrecisionRecallDisplay.from_predictions

Plot precision-recall curve given binary class predictions.

Notes

When true positive + false negative == 0, recall returns 0 and raises UndefinedMetricWarning. This behavior can be modified with zero_division.

Examples

>>> import numpy as np
>>> from sklearn.metrics import recall_score
>>> y_true = [0, 1, 2, 0, 1, 2]
>>> y_pred = [0, 2, 1, 0, 0, 1]
>>> recall_score(y_true, y_pred, average='macro')
0.33...
>>> recall_score(y_true, y_pred, average='micro')
0.33...
>>> recall_score(y_true, y_pred, average='weighted')
0.33...
>>> recall_score(y_true, y_pred, average=None)
array([1., 0., 0.])
>>> y_true = [0, 0, 0, 0, 0, 0]
>>> recall_score(y_true, y_pred, average=None)
array([0.5, 0. , 0. ])
>>> recall_score(y_true, y_pred, average=None, zero_division=1)
array([0.5, 1. , 1. ])
>>> recall_score(y_true, y_pred, average=None, zero_division=np.nan)
array([0.5, nan, nan])
>>> # multilabel classification
>>> y_true = [[0, 0, 0], [1, 1, 1], [0, 1, 1]]
>>> y_pred = [[0, 0, 0], [1, 1, 1], [1, 1, 0]]
>>> recall_score(y_true, y_pred, average=None)
array([1. , 1. , 0.5])
r2_score(y_true, y_pred, *, sample_weight=None, multioutput='uniform_average', force_finite=True)[source]

\(R^2\) (coefficient of determination) regression score function.

Best possible score is 1.0 and it can be negative (because the model can be arbitrarily worse). In the general case when the true y is non-constant, a constant model that always predicts the average y disregarding the input features would get a \(R^2\) score of 0.0.

In the particular case when y_true is constant, the \(R^2\) score is not finite: it is either NaN (perfect predictions) or -Inf (imperfect predictions). To prevent such non-finite numbers to pollute higher-level experiments such as a grid search cross-validation, by default these cases are replaced with 1.0 (perfect predictions) or 0.0 (imperfect predictions) respectively. You can set force_finite to False to prevent this fix from happening.

Note: when the prediction residuals have zero mean, the \(R^2\) score is identical to the Explained Variance score.

Read more in the User Guide.

Parameters:
  • y_true (array-like of shape (n_samples,) or (n_samples, n_outputs)) – Ground truth (correct) target values.

  • y_pred (array-like of shape (n_samples,) or (n_samples, n_outputs)) – Estimated target values.

  • sample_weight (array-like of shape (n_samples,), default=None) – Sample weights.

  • multioutput ({'raw_values', 'uniform_average', 'variance_weighted'}, array-like of shape (n_outputs,) or None, default='uniform_average') –

    Defines aggregating of multiple output scores. Array-like value defines weights used to average scores. Default is “uniform_average”.

    ’raw_values’ :

    Returns a full set of scores in case of multioutput input.

    ’uniform_average’ :

    Scores of all outputs are averaged with uniform weight.

    ’variance_weighted’ :

    Scores of all outputs are averaged, weighted by the variances of each individual output.

    Changed in version 0.19: Default value of multioutput is ‘uniform_average’.

  • force_finite (bool, default=True) –

    Flag indicating if NaN and -Inf scores resulting from constant data should be replaced with real numbers (1.0 if prediction is perfect, 0.0 otherwise). Default is True, a convenient setting for hyperparameters’ search procedures (e.g. grid search cross-validation).

    New in version 1.1.

Returns:

z – The \(R^2\) score or ndarray of scores if ‘multioutput’ is ‘raw_values’.

Return type:

float or ndarray of floats

Notes

This is not a symmetric function.

Unlike most other scores, \(R^2\) score may be negative (it need not actually be the square of a quantity R).

This metric is not well-defined for single samples and will return a NaN value if n_samples is less than two.

References

Examples

>>> from sklearn.metrics import r2_score
>>> y_true = [3, -0.5, 2, 7]
>>> y_pred = [2.5, 0.0, 2, 8]
>>> r2_score(y_true, y_pred)
0.948...
>>> y_true = [[0.5, 1], [-1, 1], [7, -6]]
>>> y_pred = [[0, 2], [-1, 2], [8, -5]]
>>> r2_score(y_true, y_pred,
...          multioutput='variance_weighted')
0.938...
>>> y_true = [1, 2, 3]
>>> y_pred = [1, 2, 3]
>>> r2_score(y_true, y_pred)
1.0
>>> y_true = [1, 2, 3]
>>> y_pred = [2, 2, 2]
>>> r2_score(y_true, y_pred)
0.0
>>> y_true = [1, 2, 3]
>>> y_pred = [3, 2, 1]
>>> r2_score(y_true, y_pred)
-3.0
>>> y_true = [-2, -2, -2]
>>> y_pred = [-2, -2, -2]
>>> r2_score(y_true, y_pred)
1.0
>>> r2_score(y_true, y_pred, force_finite=False)
nan
>>> y_true = [-2, -2, -2]
>>> y_pred = [-2, -2, -2 + 1e-8]
>>> r2_score(y_true, y_pred)
0.0
>>> r2_score(y_true, y_pred, force_finite=False)
-inf
mean_squared_error(y_true, y_pred, *, sample_weight=None, multioutput='uniform_average', squared='deprecated')[source]

Mean squared error regression loss.

Read more in the User Guide.

Parameters:
  • y_true (array-like of shape (n_samples,) or (n_samples, n_outputs)) – Ground truth (correct) target values.

  • y_pred (array-like of shape (n_samples,) or (n_samples, n_outputs)) – Estimated target values.

  • sample_weight (array-like of shape (n_samples,), default=None) – Sample weights.

  • multioutput ({'raw_values', 'uniform_average'} or array-like of shape (n_outputs,), default='uniform_average') –

    Defines aggregating of multiple output values. Array-like value defines weights used to average errors.

    ’raw_values’ :

    Returns a full set of errors in case of multioutput input.

    ’uniform_average’ :

    Errors of all outputs are averaged with uniform weight.

  • squared (bool, default=True) –

    If True returns MSE value, if False returns RMSE value.

    Deprecated since version 1.4: squared is deprecated in 1.4 and will be removed in 1.6. Use root_mean_squared_error() instead to calculate the root mean squared error.

Returns:

loss – A non-negative floating point value (the best value is 0.0), or an array of floating point values, one for each individual target.

Return type:

float or ndarray of floats

Examples

>>> from sklearn.metrics import mean_squared_error
>>> y_true = [3, -0.5, 2, 7]
>>> y_pred = [2.5, 0.0, 2, 8]
>>> mean_squared_error(y_true, y_pred)
0.375
>>> y_true = [[0.5, 1],[-1, 1],[7, -6]]
>>> y_pred = [[0, 2],[-1, 2],[8, -5]]
>>> mean_squared_error(y_true, y_pred)
0.708...
>>> mean_squared_error(y_true, y_pred, multioutput='raw_values')
array([0.41666667, 1.        ])
>>> mean_squared_error(y_true, y_pred, multioutput=[0.3, 0.7])
0.825...
mean_absolute_error(y_true, y_pred, *, sample_weight=None, multioutput='uniform_average')[source]

Mean absolute error regression loss.

Read more in the User Guide.

Parameters:
  • y_true (array-like of shape (n_samples,) or (n_samples, n_outputs)) – Ground truth (correct) target values.

  • y_pred (array-like of shape (n_samples,) or (n_samples, n_outputs)) – Estimated target values.

  • sample_weight (array-like of shape (n_samples,), default=None) – Sample weights.

  • multioutput ({'raw_values', 'uniform_average'} or array-like of shape (n_outputs,), default='uniform_average') –

    Defines aggregating of multiple output values. Array-like value defines weights used to average errors.

    ’raw_values’ :

    Returns a full set of errors in case of multioutput input.

    ’uniform_average’ :

    Errors of all outputs are averaged with uniform weight.

Returns:

loss – If multioutput is ‘raw_values’, then mean absolute error is returned for each output separately. If multioutput is ‘uniform_average’ or an ndarray of weights, then the weighted average of all output errors is returned.

MAE output is non-negative floating point. The best value is 0.0.

Return type:

float or ndarray of floats

Examples

>>> from sklearn.metrics import mean_absolute_error
>>> y_true = [3, -0.5, 2, 7]
>>> y_pred = [2.5, 0.0, 2, 8]
>>> mean_absolute_error(y_true, y_pred)
0.5
>>> y_true = [[0.5, 1], [-1, 1], [7, -6]]
>>> y_pred = [[0, 2], [-1, 2], [8, -5]]
>>> mean_absolute_error(y_true, y_pred)
0.75
>>> mean_absolute_error(y_true, y_pred, multioutput='raw_values')
array([0.5, 1. ])
>>> mean_absolute_error(y_true, y_pred, multioutput=[0.3, 0.7])
0.85...
precision_score(y_true, y_pred, *, labels=None, pos_label=1, average='binary', sample_weight=None, zero_division='warn')[source]

Compute the precision.

The precision is the ratio tp / (tp + fp) where tp is the number of true positives and fp the number of false positives. The precision is intuitively the ability of the classifier not to label as positive a sample that is negative.

The best value is 1 and the worst value is 0.

Support beyond term:binary targets is achieved by treating multiclass and multilabel data as a collection of binary problems, one for each label. For the binary case, setting average=’binary’ will return precision for pos_label. If average is not ‘binary’, pos_label is ignored and precision for both classes are computed, then averaged or both returned (when average=None). Similarly, for multiclass and multilabel targets, precision for all labels are either returned or averaged depending on the average parameter. Use labels specify the set of labels to calculate precision for.

Read more in the User Guide.

Parameters:
  • y_true (1d array-like, or label indicator array / sparse matrix) – Ground truth (correct) target values.

  • y_pred (1d array-like, or label indicator array / sparse matrix) – Estimated targets as returned by a classifier.

  • labels (array-like, default=None) –

    The set of labels to include when average != ‘binary’, and their order if average is None. Labels present in the data can be excluded, for example in multiclass classification to exclude a “negative class”. Labels not present in the data can be included and will be “assigned” 0 samples. For multilabel targets, labels are column indices. By default, all labels in y_true and y_pred are used in sorted order.

    Changed in version 0.17: Parameter labels improved for multiclass problem.

  • pos_label (int, float, bool or str, default=1) – The class to report if average=’binary’ and the data is binary, otherwise this parameter is ignored. For multiclass or multilabel targets, set labels=[pos_label] and average != ‘binary’ to report metrics for one label only.

  • average ({'micro', 'macro', 'samples', 'weighted', 'binary'} or None, default='binary') –

    This parameter is required for multiclass/multilabel targets. If None, the scores for each class are returned. Otherwise, this determines the type of averaging performed on the data:

    'binary':

    Only report results for the class specified by pos_label. This is applicable only if targets (y_{true,pred}) are binary.

    'micro':

    Calculate metrics globally by counting the total true positives, false negatives and false positives.

    'macro':

    Calculate metrics for each label, and find their unweighted mean. This does not take label imbalance into account.

    'weighted':

    Calculate metrics for each label, and find their average weighted by support (the number of true instances for each label). This alters ‘macro’ to account for label imbalance; it can result in an F-score that is not between precision and recall.

    'samples':

    Calculate metrics for each instance, and find their average (only meaningful for multilabel classification where this differs from accuracy_score()).

  • sample_weight (array-like of shape (n_samples,), default=None) – Sample weights.

  • zero_division ({"warn", 0.0, 1.0, np.nan}, default="warn") –

    Sets the value to return when there is a zero division.

    Notes: - If set to “warn”, this acts like 0, but a warning is also raised. - If set to np.nan, such values will be excluded from the average.

    New in version 1.3: np.nan option was added.

Returns:

precision – Precision of the positive class in binary classification or weighted average of the precision of each class for the multiclass task.

Return type:

float (if average is not None) or array of float of shape (n_unique_labels,)

See also

precision_recall_fscore_support

Compute precision, recall, F-measure and support for each class.

recall_score

Compute the ratio tp / (tp + fn) where tp is the number of true positives and fn the number of false negatives.

PrecisionRecallDisplay.from_estimator

Plot precision-recall curve given an estimator and some data.

PrecisionRecallDisplay.from_predictions

Plot precision-recall curve given binary class predictions.

multilabel_confusion_matrix

Compute a confusion matrix for each class or sample.

Notes

When true positive + false positive == 0, precision returns 0 and raises UndefinedMetricWarning. This behavior can be modified with zero_division.

Examples

>>> import numpy as np
>>> from sklearn.metrics import precision_score
>>> y_true = [0, 1, 2, 0, 1, 2]
>>> y_pred = [0, 2, 1, 0, 0, 1]
>>> precision_score(y_true, y_pred, average='macro')
0.22...
>>> precision_score(y_true, y_pred, average='micro')
0.33...
>>> precision_score(y_true, y_pred, average='weighted')
0.22...
>>> precision_score(y_true, y_pred, average=None)
array([0.66..., 0.        , 0.        ])
>>> y_pred = [0, 0, 0, 0, 0, 0]
>>> precision_score(y_true, y_pred, average=None)
array([0.33..., 0.        , 0.        ])
>>> precision_score(y_true, y_pred, average=None, zero_division=1)
array([0.33..., 1.        , 1.        ])
>>> precision_score(y_true, y_pred, average=None, zero_division=np.nan)
array([0.33...,        nan,        nan])
>>> # multilabel classification
>>> y_true = [[0, 0, 0], [1, 1, 1], [0, 1, 1]]
>>> y_pred = [[0, 0, 0], [1, 1, 1], [1, 1, 0]]
>>> precision_score(y_true, y_pred, average=None)
array([0.5, 1. , 1. ])
precision_recall_curve(y_true, probas_pred, *, pos_label=None, sample_weight=None, drop_intermediate=False)[source]

Compute precision-recall pairs for different probability thresholds.

Note: this implementation is restricted to the binary classification task.

The precision is the ratio tp / (tp + fp) where tp is the number of true positives and fp the number of false positives. The precision is intuitively the ability of the classifier not to label as positive a sample that is negative.

The recall is the ratio tp / (tp + fn) where tp is the number of true positives and fn the number of false negatives. The recall is intuitively the ability of the classifier to find all the positive samples.

The last precision and recall values are 1. and 0. respectively and do not have a corresponding threshold. This ensures that the graph starts on the y axis.

The first precision and recall values are precision=class balance and recall=1.0 which corresponds to a classifier that always predicts the positive class.

Read more in the User Guide.

Parameters:
  • y_true (array-like of shape (n_samples,)) – True binary labels. If labels are not either {-1, 1} or {0, 1}, then pos_label should be explicitly given.

  • probas_pred (array-like of shape (n_samples,)) – Target scores, can either be probability estimates of the positive class, or non-thresholded measure of decisions (as returned by decision_function on some classifiers).

  • pos_label (int, float, bool or str, default=None) – The label of the positive class. When pos_label=None, if y_true is in {-1, 1} or {0, 1}, pos_label is set to 1, otherwise an error will be raised.

  • sample_weight (array-like of shape (n_samples,), default=None) – Sample weights.

  • drop_intermediate (bool, default=False) –

    Whether to drop some suboptimal thresholds which would not appear on a plotted precision-recall curve. This is useful in order to create lighter precision-recall curves.

    New in version 1.3.

Returns:

  • precision (ndarray of shape (n_thresholds + 1,)) – Precision values such that element i is the precision of predictions with score >= thresholds[i] and the last element is 1.

  • recall (ndarray of shape (n_thresholds + 1,)) – Decreasing recall values such that element i is the recall of predictions with score >= thresholds[i] and the last element is 0.

  • thresholds (ndarray of shape (n_thresholds,)) – Increasing thresholds on the decision function used to compute precision and recall where n_thresholds = len(np.unique(probas_pred)).

See also

PrecisionRecallDisplay.from_estimator

Plot Precision Recall Curve given a binary classifier.

PrecisionRecallDisplay.from_predictions

Plot Precision Recall Curve using predictions from a binary classifier.

average_precision_score

Compute average precision from prediction scores.

det_curve

Compute error rates for different probability thresholds.

roc_curve

Compute Receiver operating characteristic (ROC) curve.

Examples

>>> import numpy as np
>>> from sklearn.metrics import precision_recall_curve
>>> y_true = np.array([0, 0, 1, 1])
>>> y_scores = np.array([0.1, 0.4, 0.35, 0.8])
>>> precision, recall, thresholds = precision_recall_curve(
...     y_true, y_scores)
>>> precision
array([0.5       , 0.66666667, 0.5       , 1.        , 1.        ])
>>> recall
array([1. , 1. , 0.5, 0.5, 0. ])
>>> thresholds
array([0.1 , 0.35, 0.4 , 0.8 ])
auc(x, y)[source]

Compute Area Under the Curve (AUC) using the trapezoidal rule.

This is a general function, given points on a curve. For computing the area under the ROC-curve, see roc_auc_score(). For an alternative way to summarize a precision-recall curve, see average_precision_score().

Parameters:
  • x (array-like of shape (n,)) – X coordinates. These must be either monotonic increasing or monotonic decreasing.

  • y (array-like of shape (n,)) – Y coordinates.

Returns:

auc – Area Under the Curve.

Return type:

float

See also

roc_auc_score

Compute the area under the ROC curve.

average_precision_score

Compute average precision from prediction scores.

precision_recall_curve

Compute precision-recall pairs for different probability thresholds.

Examples

>>> import numpy as np
>>> from sklearn import metrics
>>> y = np.array([1, 1, 2, 2])
>>> pred = np.array([0.1, 0.4, 0.35, 0.8])
>>> fpr, tpr, thresholds = metrics.roc_curve(y, pred, pos_label=2)
>>> metrics.auc(fpr, tpr)
0.75
jaccard_score(y_true, y_pred, *, labels=None, pos_label=1, average='binary', sample_weight=None, zero_division='warn')[source]

Jaccard similarity coefficient score.

The Jaccard index [1], or Jaccard similarity coefficient, defined as the size of the intersection divided by the size of the union of two label sets, is used to compare set of predicted labels for a sample to the corresponding set of labels in y_true.

Support beyond term:binary targets is achieved by treating multiclass and multilabel data as a collection of binary problems, one for each label. For the binary case, setting average=’binary’ will return the Jaccard similarity coefficient for pos_label. If average is not ‘binary’, pos_label is ignored and scores for both classes are computed, then averaged or both returned (when average=None). Similarly, for multiclass and multilabel targets, scores for all labels are either returned or averaged depending on the average parameter. Use labels specify the set of labels to calculate the score for.

Read more in the User Guide.

Parameters:
  • y_true (1d array-like, or label indicator array / sparse matrix) – Ground truth (correct) labels.

  • y_pred (1d array-like, or label indicator array / sparse matrix) – Predicted labels, as returned by a classifier.

  • labels (array-like of shape (n_classes,), default=None) – The set of labels to include when average != ‘binary’, and their order if average is None. Labels present in the data can be excluded, for example in multiclass classification to exclude a “negative class”. Labels not present in the data can be included and will be “assigned” 0 samples. For multilabel targets, labels are column indices. By default, all labels in y_true and y_pred are used in sorted order.

  • pos_label (int, float, bool or str, default=1) – The class to report if average=’binary’ and the data is binary, otherwise this parameter is ignored. For multiclass or multilabel targets, set labels=[pos_label] and average != ‘binary’ to report metrics for one label only.

  • average ({'micro', 'macro', 'samples', 'weighted', 'binary'} or None, default='binary') –

    If None, the scores for each class are returned. Otherwise, this determines the type of averaging performed on the data:

    'binary':

    Only report results for the class specified by pos_label. This is applicable only if targets (y_{true,pred}) are binary.

    'micro':

    Calculate metrics globally by counting the total true positives, false negatives and false positives.

    'macro':

    Calculate metrics for each label, and find their unweighted mean. This does not take label imbalance into account.

    'weighted':

    Calculate metrics for each label, and find their average, weighted by support (the number of true instances for each label). This alters ‘macro’ to account for label imbalance.

    'samples':

    Calculate metrics for each instance, and find their average (only meaningful for multilabel classification).

  • sample_weight (array-like of shape (n_samples,), default=None) – Sample weights.

  • zero_division ("warn", {0.0, 1.0}, default="warn") – Sets the value to return when there is a zero division, i.e. when there there are no negative values in predictions and labels. If set to “warn”, this acts like 0, but a warning is also raised.

Returns:

score – The Jaccard score. When average is not None, a single scalar is returned.

Return type:

float or ndarray of shape (n_unique_labels,), dtype=np.float64

See also

accuracy_score

Function for calculating the accuracy score.

f1_score

Function for calculating the F1 score.

multilabel_confusion_matrix

Function for computing a confusion matrix for each class or sample.

Notes

jaccard_score() may be a poor metric if there are no positives for some samples or classes. Jaccard is undefined if there are no true or predicted labels, and our implementation will return a score of 0 with a warning.

References

Examples

>>> import numpy as np
>>> from sklearn.metrics import jaccard_score
>>> y_true = np.array([[0, 1, 1],
...                    [1, 1, 0]])
>>> y_pred = np.array([[1, 1, 1],
...                    [1, 0, 0]])

In the binary case:

>>> jaccard_score(y_true[0], y_pred[0])
0.6666...

In the 2D comparison case (e.g. image similarity):

>>> jaccard_score(y_true, y_pred, average="micro")
0.6

In the multilabel case:

>>> jaccard_score(y_true, y_pred, average='samples')
0.5833...
>>> jaccard_score(y_true, y_pred, average='macro')
0.6666...
>>> jaccard_score(y_true, y_pred, average=None)
array([0.5, 0.5, 1. ])

In the multiclass case:

>>> y_pred = [0, 2, 1, 2]
>>> y_true = [0, 1, 2, 2]
>>> jaccard_score(y_true, y_pred, average=None)
array([1. , 0. , 0.33...])
f1_score(y_true, y_pred, *, labels=None, pos_label=1, average='binary', sample_weight=None, zero_division='warn')[source]

Compute the F1 score, also known as balanced F-score or F-measure.

The F1 score can be interpreted as a harmonic mean of the precision and recall, where an F1 score reaches its best value at 1 and worst score at 0. The relative contribution of precision and recall to the F1 score are equal. The formula for the F1 score is:

\[\text{F1} = \frac{2 * \text{TP}}{2 * \text{TP} + \text{FP} + \text{FN}}\]

Where \(\text{TP}\) is the number of true positives, \(\text{FN}\) is the number of false negatives, and \(\text{FP}\) is the number of false positives. F1 is by default calculated as 0.0 when there are no true positives, false negatives, or false positives.

Support beyond binary targets is achieved by treating multiclass and multilabel data as a collection of binary problems, one for each label. For the binary case, setting average=’binary’ will return F1 score for pos_label. If average is not ‘binary’, pos_label is ignored and F1 score for both classes are computed, then averaged or both returned (when average=None). Similarly, for multiclass and multilabel targets, F1 score for all labels are either returned or averaged depending on the average parameter. Use labels specify the set of labels to calculate F1 score for.

Read more in the User Guide.

Parameters:
  • y_true (1d array-like, or label indicator array / sparse matrix) – Ground truth (correct) target values.

  • y_pred (1d array-like, or label indicator array / sparse matrix) – Estimated targets as returned by a classifier.

  • labels (array-like, default=None) –

    The set of labels to include when average != ‘binary’, and their order if average is None. Labels present in the data can be excluded, for example in multiclass classification to exclude a “negative class”. Labels not present in the data can be included and will be “assigned” 0 samples. For multilabel targets, labels are column indices. By default, all labels in y_true and y_pred are used in sorted order.

    Changed in version 0.17: Parameter labels improved for multiclass problem.

  • pos_label (int, float, bool or str, default=1) – The class to report if average=’binary’ and the data is binary, otherwise this parameter is ignored. For multiclass or multilabel targets, set labels=[pos_label] and average != ‘binary’ to report metrics for one label only.

  • average ({'micro', 'macro', 'samples', 'weighted', 'binary'} or None, default='binary') –

    This parameter is required for multiclass/multilabel targets. If None, the scores for each class are returned. Otherwise, this determines the type of averaging performed on the data:

    'binary':

    Only report results for the class specified by pos_label. This is applicable only if targets (y_{true,pred}) are binary.

    'micro':

    Calculate metrics globally by counting the total true positives, false negatives and false positives.

    'macro':

    Calculate metrics for each label, and find their unweighted mean. This does not take label imbalance into account.

    'weighted':

    Calculate metrics for each label, and find their average weighted by support (the number of true instances for each label). This alters ‘macro’ to account for label imbalance; it can result in an F-score that is not between precision and recall.

    'samples':

    Calculate metrics for each instance, and find their average (only meaningful for multilabel classification where this differs from accuracy_score()).

  • sample_weight (array-like of shape (n_samples,), default=None) – Sample weights.

  • zero_division ({"warn", 0.0, 1.0, np.nan}, default="warn") –

    Sets the value to return when there is a zero division, i.e. when all predictions and labels are negative.

    Notes: - If set to “warn”, this acts like 0, but a warning is also raised. - If set to np.nan, such values will be excluded from the average.

    New in version 1.3: np.nan option was added.

Returns:

f1_score – F1 score of the positive class in binary classification or weighted average of the F1 scores of each class for the multiclass task.

Return type:

float or array of float, shape = [n_unique_labels]

See also

fbeta_score

Compute the F-beta score.

precision_recall_fscore_support

Compute the precision, recall, F-score, and support.

jaccard_score

Compute the Jaccard similarity coefficient score.

multilabel_confusion_matrix

Compute a confusion matrix for each class or sample.

Notes

When true positive + false positive + false negative == 0 (i.e. a class is completely absent from both y_true or y_pred), f-score is undefined. In such cases, by default f-score will be set to 0.0, and UndefinedMetricWarning will be raised. This behavior can be modified by setting the zero_division parameter.

References

Examples

>>> import numpy as np
>>> from sklearn.metrics import f1_score
>>> y_true = [0, 1, 2, 0, 1, 2]
>>> y_pred = [0, 2, 1, 0, 0, 1]
>>> f1_score(y_true, y_pred, average='macro')
0.26...
>>> f1_score(y_true, y_pred, average='micro')
0.33...
>>> f1_score(y_true, y_pred, average='weighted')
0.26...
>>> f1_score(y_true, y_pred, average=None)
array([0.8, 0. , 0. ])
>>> # binary classification
>>> y_true_empty = [0, 0, 0, 0, 0, 0]
>>> y_pred_empty = [0, 0, 0, 0, 0, 0]
>>> f1_score(y_true_empty, y_pred_empty)
0.0...
>>> f1_score(y_true_empty, y_pred_empty, zero_division=1.0)
1.0...
>>> f1_score(y_true_empty, y_pred_empty, zero_division=np.nan)
nan...
>>> # multilabel classification
>>> y_true = [[0, 0, 0], [1, 1, 1], [0, 1, 1]]
>>> y_pred = [[0, 0, 0], [1, 1, 1], [1, 1, 0]]
>>> f1_score(y_true, y_pred, average=None)
array([0.66666667, 1.        , 0.66666667])
roc_auc_score(y_true, y_score, *, average='macro', sample_weight=None, max_fpr=None, multi_class='raise', labels=None)[source]

Compute Area Under the Receiver Operating Characteristic Curve (ROC AUC) from prediction scores.

Note: this implementation can be used with binary, multiclass and multilabel classification, but some restrictions apply (see Parameters).

Read more in the User Guide.

Parameters:
  • y_true (array-like of shape (n_samples,) or (n_samples, n_classes)) – True labels or binary label indicators. The binary and multiclass cases expect labels with shape (n_samples,) while the multilabel case expects binary label indicators with shape (n_samples, n_classes).

  • y_score (array-like of shape (n_samples,) or (n_samples, n_classes)) –

    Target scores.

    • In the binary case, it corresponds to an array of shape (n_samples,). Both probability estimates and non-thresholded decision values can be provided. The probability estimates correspond to the probability of the class with the greater label, i.e. estimator.classes_[1] and thus estimator.predict_proba(X, y)[:, 1]. The decision values corresponds to the output of estimator.decision_function(X, y). See more information in the User guide;

    • In the multiclass case, it corresponds to an array of shape (n_samples, n_classes) of probability estimates provided by the predict_proba method. The probability estimates must sum to 1 across the possible classes. In addition, the order of the class scores must correspond to the order of labels, if provided, or else to the numerical or lexicographical order of the labels in y_true. See more information in the User guide;

    • In the multilabel case, it corresponds to an array of shape (n_samples, n_classes). Probability estimates are provided by the predict_proba method and the non-thresholded decision values by the decision_function method. The probability estimates correspond to the probability of the class with the greater label for each output of the classifier. See more information in the User guide.

  • average ({'micro', 'macro', 'samples', 'weighted'} or None, default='macro') –

    If None, the scores for each class are returned. Otherwise, this determines the type of averaging performed on the data. Note: multiclass ROC AUC currently only handles the ‘macro’ and ‘weighted’ averages. For multiclass targets, average=None is only implemented for multi_class=’ovr’ and average=’micro’ is only implemented for multi_class=’ovr’.

    'micro':

    Calculate metrics globally by considering each element of the label indicator matrix as a label.

    'macro':

    Calculate metrics for each label, and find their unweighted mean. This does not take label imbalance into account.

    'weighted':

    Calculate metrics for each label, and find their average, weighted by support (the number of true instances for each label).

    'samples':

    Calculate metrics for each instance, and find their average.

    Will be ignored when y_true is binary.

  • sample_weight (array-like of shape (n_samples,), default=None) – Sample weights.

  • max_fpr (float > 0 and <= 1, default=None) – If not None, the standardized partial AUC [2]_ over the range [0, max_fpr] is returned. For the multiclass case, max_fpr, should be either equal to None or 1.0 as AUC ROC partial computation currently is not supported for multiclass.

  • multi_class ({'raise', 'ovr', 'ovo'}, default='raise') –

    Only used for multiclass targets. Determines the type of configuration to use. The default value raises an error, so either 'ovr' or 'ovo' must be passed explicitly.

    'ovr':

    Stands for One-vs-rest. Computes the AUC of each class against the rest [3]_ [4]_. This treats the multiclass case in the same way as the multilabel case. Sensitive to class imbalance even when average == 'macro', because class imbalance affects the composition of each of the ‘rest’ groupings.

    'ovo':

    Stands for One-vs-one. Computes the average AUC of all possible pairwise combinations of classes [5]. Insensitive to class imbalance when average == 'macro'.

  • labels (array-like of shape (n_classes,), default=None) – Only used for multiclass targets. List of labels that index the classes in y_score. If None, the numerical or lexicographical order of the labels in y_true is used.

Returns:

auc – Area Under the Curve score.

Return type:

float

See also

average_precision_score

Area under the precision-recall curve.

roc_curve

Compute Receiver operating characteristic (ROC) curve.

RocCurveDisplay.from_estimator

Plot Receiver Operating Characteristic (ROC) curve given an estimator and some data.

RocCurveDisplay.from_predictions

Plot Receiver Operating Characteristic (ROC) curve given the true and predicted values.

Notes

The Gini Coefficient is a summary measure of the ranking ability of binary classifiers. It is expressed using the area under of the ROC as follows:

G = 2 * AUC - 1

Where G is the Gini coefficient and AUC is the ROC-AUC score. This normalisation will ensure that random guessing will yield a score of 0 in expectation, and it is upper bounded by 1.

References

Examples

Binary case:

>>> from sklearn.datasets import load_breast_cancer
>>> from sklearn.linear_model import LogisticRegression
>>> from sklearn.metrics import roc_auc_score
>>> X, y = load_breast_cancer(return_X_y=True)
>>> clf = LogisticRegression(solver="liblinear", random_state=0).fit(X, y)
>>> roc_auc_score(y, clf.predict_proba(X)[:, 1])
0.99...
>>> roc_auc_score(y, clf.decision_function(X))
0.99...

Multiclass case:

>>> from sklearn.datasets import load_iris
>>> X, y = load_iris(return_X_y=True)
>>> clf = LogisticRegression(solver="liblinear").fit(X, y)
>>> roc_auc_score(y, clf.predict_proba(X), multi_class='ovr')
0.99...

Multilabel case:

>>> import numpy as np
>>> from sklearn.datasets import make_multilabel_classification
>>> from sklearn.multioutput import MultiOutputClassifier
>>> X, y = make_multilabel_classification(random_state=0)
>>> clf = MultiOutputClassifier(clf).fit(X, y)
>>> # get a list of n_output containing probability arrays of shape
>>> # (n_samples, n_classes)
>>> y_pred = clf.predict_proba(X)
>>> # extract the positive columns for each output
>>> y_pred = np.transpose([pred[:, 1] for pred in y_pred])
>>> roc_auc_score(y, y_pred, average=None)
array([0.82..., 0.86..., 0.94..., 0.85... , 0.94...])
>>> from sklearn.linear_model import RidgeClassifierCV
>>> clf = RidgeClassifierCV().fit(X, y)
>>> roc_auc_score(y, clf.decision_function(X), average=None)
array([0.81..., 0.84... , 0.93..., 0.87..., 0.94...])
accuracy_score(y_true, y_pred, *, normalize=True, sample_weight=None)[source]

Accuracy classification score.

In multilabel classification, this function computes subset accuracy: the set of labels predicted for a sample must exactly match the corresponding set of labels in y_true.

Read more in the User Guide.

Parameters:
  • y_true (1d array-like, or label indicator array / sparse matrix) – Ground truth (correct) labels.

  • y_pred (1d array-like, or label indicator array / sparse matrix) – Predicted labels, as returned by a classifier.

  • normalize (bool, default=True) – If False, return the number of correctly classified samples. Otherwise, return the fraction of correctly classified samples.

  • sample_weight (array-like of shape (n_samples,), default=None) – Sample weights.

Returns:

score – If normalize == True, return the fraction of correctly classified samples (float), else returns the number of correctly classified samples (int).

The best performance is 1 with normalize == True and the number of samples with normalize == False.

Return type:

float

See also

balanced_accuracy_score

Compute the balanced accuracy to deal with imbalanced datasets.

jaccard_score

Compute the Jaccard similarity coefficient score.

hamming_loss

Compute the average Hamming loss or Hamming distance between two sets of samples.

zero_one_loss

Compute the Zero-one classification loss. By default, the function will return the percentage of imperfectly predicted subsets.

Notes

In binary classification, this function is equal to the jaccard_score function.

Examples

>>> from sklearn.metrics import accuracy_score
>>> y_pred = [0, 2, 1, 3]
>>> y_true = [0, 1, 2, 3]
>>> accuracy_score(y_true, y_pred)
0.5
>>> accuracy_score(y_true, y_pred, normalize=False)
2.0

In the multilabel case with binary label indicators:

>>> import numpy as np
>>> accuracy_score(np.array([[0, 1], [1, 1]]), np.ones((2, 2)))
0.5
balanced_accuracy_score(y_true, y_pred, *, sample_weight=None, adjusted=False)[source]

Compute the balanced accuracy.

The balanced accuracy in binary and multiclass classification problems to deal with imbalanced datasets. It is defined as the average of recall obtained on each class.

The best value is 1 and the worst value is 0 when adjusted=False.

Read more in the User Guide.

New in version 0.20.

Parameters:
  • y_true (array-like of shape (n_samples,)) – Ground truth (correct) target values.

  • y_pred (array-like of shape (n_samples,)) – Estimated targets as returned by a classifier.

  • sample_weight (array-like of shape (n_samples,), default=None) – Sample weights.

  • adjusted (bool, default=False) – When true, the result is adjusted for chance, so that random performance would score 0, while keeping perfect performance at a score of 1.

Returns:

balanced_accuracy – Balanced accuracy score.

Return type:

float

See also

average_precision_score

Compute average precision (AP) from prediction scores.

precision_score

Compute the precision score.

recall_score

Compute the recall score.

roc_auc_score

Compute Area Under the Receiver Operating Characteristic Curve (ROC AUC) from prediction scores.

Notes

Some literature promotes alternative definitions of balanced accuracy. Our definition is equivalent to accuracy_score() with class-balanced sample weights, and shares desirable properties with the binary case. See the User Guide.

References

Examples

>>> from sklearn.metrics import balanced_accuracy_score
>>> y_true = [0, 1, 0, 0, 1, 0]
>>> y_pred = [0, 1, 0, 0, 0, 1]
>>> balanced_accuracy_score(y_true, y_pred)
0.625
top_k_accuracy_score(y_true, y_score, *, k=2, normalize=True, sample_weight=None, labels=None)[source]

Top-k Accuracy classification score.

This metric computes the number of times where the correct label is among the top k labels predicted (ranked by predicted scores). Note that the multilabel case isn’t covered here.

Read more in the User Guide

Parameters:
  • y_true (array-like of shape (n_samples,)) – True labels.

  • y_score (array-like of shape (n_samples,) or (n_samples, n_classes)) – Target scores. These can be either probability estimates or non-thresholded decision values (as returned by decision_function on some classifiers). The binary case expects scores with shape (n_samples,) while the multiclass case expects scores with shape (n_samples, n_classes). In the multiclass case, the order of the class scores must correspond to the order of labels, if provided, or else to the numerical or lexicographical order of the labels in y_true. If y_true does not contain all the labels, labels must be provided.

  • k (int, default=2) – Number of most likely outcomes considered to find the correct label.

  • normalize (bool, default=True) – If True, return the fraction of correctly classified samples. Otherwise, return the number of correctly classified samples.

  • sample_weight (array-like of shape (n_samples,), default=None) – Sample weights. If None, all samples are given the same weight.

  • labels (array-like of shape (n_classes,), default=None) – Multiclass only. List of labels that index the classes in y_score. If None, the numerical or lexicographical order of the labels in y_true is used. If y_true does not contain all the labels, labels must be provided.

Returns:

score – The top-k accuracy score. The best performance is 1 with normalize == True and the number of samples with normalize == False.

Return type:

float

See also

accuracy_score

Compute the accuracy score. By default, the function will return the fraction of correct predictions divided by the total number of predictions.

Notes

In cases where two or more labels are assigned equal predicted scores, the labels with the highest indices will be chosen first. This might impact the result if the correct label falls after the threshold because of that.

Examples

>>> import numpy as np
>>> from sklearn.metrics import top_k_accuracy_score
>>> y_true = np.array([0, 1, 2, 2])
>>> y_score = np.array([[0.5, 0.2, 0.2],  # 0 is in top 2
...                     [0.3, 0.4, 0.2],  # 1 is in top 2
...                     [0.2, 0.4, 0.3],  # 2 is in top 2
...                     [0.7, 0.2, 0.1]]) # 2 isn't in top 2
>>> top_k_accuracy_score(y_true, y_score, k=2)
0.75
>>> # Not normalizing gives the number of "correctly" classified samples
>>> top_k_accuracy_score(y_true, y_score, k=2, normalize=False)
3
pearson_r2_score(y: ndarray, y_pred: ndarray) float[source]

Computes Pearson R^2 (square of Pearson correlation).

Parameters:
  • y (np.ndarray) – ground truth array

  • y_pred (np.ndarray) – predicted array

Returns:

The Pearson-R^2 score.

Return type:

float

jaccard_index(y: ndarray, y_pred: ndarray) float[source]

Computes Jaccard Index which is the Intersection Over Union metric which is commonly used in image segmentation tasks.

DEPRECATED: WILL BE REMOVED IN A FUTURE VERSION OF DEEEPCHEM. USE jaccard_score instead.

Parameters:
  • y (np.ndarray) – ground truth array

  • y_pred (np.ndarray) – predicted array

Returns:

score – The jaccard index. A number between 0 and 1.

Return type:

float

pixel_error(y: ndarray, y_pred: ndarray) float[source]

An error metric in case y, y_pred are images.

Defined as 1 - the maximal F-score of pixel similarity, or squared Euclidean distance between the original and the result labels.

Parameters:
  • y (np.ndarray) – ground truth array

  • y_pred (np.ndarray) – predicted array

Returns:

score – The pixel-error. A number between 0 and 1.

Return type:

float

prc_auc_score(y: ndarray, y_pred: ndarray) float[source]

Compute area under precision-recall curve

Parameters:
  • y (np.ndarray) – A numpy array of shape (N, n_classes) or (N,) with true labels

  • y_pred (np.ndarray) – Of shape (N, n_classes) with class probabilities.

Returns:

The area under the precision-recall curve. A number between 0 and 1.

Return type:

float

rms_score(y_true: ndarray, y_pred: ndarray) float[source]

Computes RMS error.

mae_score(y_true: ndarray, y_pred: ndarray) float[source]

Computes MAE.

kappa_score(y1, y2, *, labels=None, weights=None, sample_weight=None)[source]

Compute Cohen’s kappa: a statistic that measures inter-annotator agreement.

This function computes Cohen’s kappa [1]_, a score that expresses the level of agreement between two annotators on a classification problem. It is defined as

\[\kappa = (p_o - p_e) / (1 - p_e)\]

where \(p_o\) is the empirical probability of agreement on the label assigned to any sample (the observed agreement ratio), and \(p_e\) is the expected agreement when both annotators assign labels randomly. \(p_e\) is estimated using a per-annotator empirical prior over the class labels [2]_.

Read more in the User Guide.

Parameters:
  • y1 (array-like of shape (n_samples,)) – Labels assigned by the first annotator.

  • y2 (array-like of shape (n_samples,)) – Labels assigned by the second annotator. The kappa statistic is symmetric, so swapping y1 and y2 doesn’t change the value.

  • labels (array-like of shape (n_classes,), default=None) – List of labels to index the matrix. This may be used to select a subset of labels. If None, all labels that appear at least once in y1 or y2 are used.

  • weights ({'linear', 'quadratic'}, default=None) – Weighting type to calculate the score. None means no weighted; “linear” means linear weighted; “quadratic” means quadratic weighted.

  • sample_weight (array-like of shape (n_samples,), default=None) – Sample weights.

Returns:

kappa – The kappa statistic, which is a number between -1 and 1. The maximum value means complete agreement; zero or lower means chance agreement.

Return type:

float

References

Examples

>>> from sklearn.metrics import cohen_kappa_score
>>> y1 = ["negative", "positive", "negative", "neutral", "positive"]
>>> y2 = ["negative", "positive", "negative", "neutral", "negative"]
>>> cohen_kappa_score(y1, y2)
0.6875
bedroc_score(y_true: ndarray, y_pred: ndarray, alpha: float = 20.0)[source]

Compute BEDROC metric.

BEDROC metric implemented according to Truchon and Bayley that modifies the ROC score by allowing for a factor of early recognition. Please confirm details from [1]_.

Parameters:
  • y_true (np.ndarray) – Binary class labels. 1 for positive class, 0 otherwise

  • y_pred (np.ndarray) – Predicted labels

  • alpha (float, default 20.0) – Early recognition parameter

Returns:

Value in [0, 1] that indicates the degree of early recognition

Return type:

float

Notes

This function requires RDKit to be installed.

References

concordance_index(y_true: ndarray, y_pred: ndarray) float[source]

Compute Concordance index.

Statistical metric indicates the quality of the predicted ranking. Please confirm details from [1]_.

Parameters:
  • y_true (np.ndarray) – continous value

  • y_pred (np.ndarray) – Predicted value

Returns:

score between [0,1]

Return type:

float

References

get_motif_scores(encoded_sequences: ndarray, motif_names: List[str], max_scores: int | None = None, return_positions: bool = False, GC_fraction: float = 0.4) ndarray[source]

Computes pwm log odds.

Parameters:
  • encoded_sequences (np.ndarray) – A numpy array of shape (N_sequences, N_letters, sequence_length, 1).

  • motif_names (List[str]) – List of motif file names.

  • max_scores (int, optional) – Get top max_scores scores.

  • return_positions (bool, default False) – Whether to return postions or not.

  • GC_fraction (float, default 0.4) – GC fraction in background sequence.

Returns:

A numpy array of complete score. The shape is (N_sequences, num_motifs, seq_length) by default. If max_scores, the shape of score array is (N_sequences, num_motifs*max_scores). If max_scores and return_positions, the shape of score array with max scores and their positions. is (N_sequences, 2*num_motifs*max_scores).

Return type:

np.ndarray

Notes

This method requires simdna to be installed.

get_pssm_scores(encoded_sequences: ndarray, pssm: ndarray) ndarray[source]

Convolves pssm and its reverse complement with encoded sequences and returns the maximum score at each position of each sequence.

Parameters:
  • encoded_sequences (np.ndarray) – A numpy array of shape (N_sequences, N_letters, sequence_length, 1).

  • pssm (np.ndarray) – A numpy array of shape (4, pssm_length).

Returns:

scores – A numpy array of shape (N_sequences, sequence_length).

Return type:

np.ndarray

in_silico_mutagenesis(model: Model, encoded_sequences: ndarray) ndarray[source]

Computes in-silico-mutagenesis scores

Parameters:
  • model (Model) – This can be any model that accepts inputs of the required shape and produces an output of shape (N_sequences, N_tasks).

  • encoded_sequences (np.ndarray) – A numpy array of shape (N_sequences, N_letters, sequence_length, 1)

Returns:

A numpy array of ISM scores. The shape is (num_task, N_sequences, N_letters, sequence_length, 1).

Return type:

np.ndarray

Metric Class

The dc.metrics.Metric class is a wrapper around metric functions which interoperates with DeepChem dc.models.Model.

class Metric(metric: Callable[[...], float], task_averager: Callable[[...], Any] | None = None, name: str | None = None, threshold: float | None = None, mode: str | None = None, n_tasks: int | None = None, classification_handling_mode: str | None = None, threshold_value: float | None = None)[source]

Wrapper class for computing user-defined metrics.

The Metric class provides a wrapper for standardizing the API around different classes of metrics that may be useful for DeepChem models. The implementation provides a few non-standard conveniences such as built-in support for multitask and multiclass metrics.

There are a variety of different metrics this class aims to support. Metrics for classification and regression that assume that values to compare are scalars are supported.

At present, this class doesn’t support metric computation on models which don’t present scalar outputs. For example, if you have a generative model which predicts images or molecules, you will need to write a custom evaluation and metric setup.

__init__(metric: Callable[[...], float], task_averager: Callable[[...], Any] | None = None, name: str | None = None, threshold: float | None = None, mode: str | None = None, n_tasks: int | None = None, classification_handling_mode: str | None = None, threshold_value: float | None = None)[source]
Parameters:
  • metric (function) – Function that takes args y_true, y_pred (in that order) and computes desired score. If sample weights are to be considered, metric may take in an additional keyword argument sample_weight.

  • task_averager (function, default None) – If not None, should be a function that averages metrics across tasks.

  • name (str, default None) – Name of this metric

  • threshold (float, default None (DEPRECATED)) – Used for binary metrics and is the threshold for the positive class.

  • mode (str, default None) – Should usually be “classification” or “regression.”

  • n_tasks (int, default None) – The number of tasks this class is expected to handle.

  • classification_handling_mode (str, default None) –

    DeepChem models by default predict class probabilities for classification problems. This means that for a given singletask prediction, after shape normalization, the DeepChem labels and prediction will be numpy arrays of shape (n_samples, n_tasks, n_classes) with class probabilities. classification_handling_mode is a string that instructs this method how to handle transforming these probabilities. It can take on the following values: - “direct”: Pass y_true and y_pred directy into self.metric. - “threshold”: Use threshold_predictions to threshold y_true and y_pred.

    Use threshold_value as the desired threshold. This converts them into arrays of shape (n_samples, n_tasks), where each element is a class index.

    • ”threshold-one-hot”: Use threshold_predictions to threshold y_true and y_pred using threshold_values, then apply to_one_hot to output.

    • None: Select a mode automatically based on the metric.

  • threshold_value (float, default None) – If set, and classification_handling_mode is “threshold” or “threshold-one-hot”, apply a thresholding operation to values with this threshold. This option is only sensible on binary classification tasks. For multiclass problems, or if threshold_value is None, argmax() is used to select the highest probability class for each task.

compute_metric(y_true: _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | bool | int | float | complex | str | bytes | _NestedSequence[bool | int | float | complex | str | bytes], y_pred: _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | bool | int | float | complex | str | bytes | _NestedSequence[bool | int | float | complex | str | bytes], w: _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | bool | int | float | complex | str | bytes | _NestedSequence[bool | int | float | complex | str | bytes] | None = None, n_tasks: int | None = None, n_classes: int = 2, per_task_metrics: bool = False, use_sample_weights: bool = False, **kwargs) Any[source]

Compute a performance metric for each task.

Parameters:
  • y_true (ArrayLike) – An ArrayLike containing true values for each task. Must be of shape (N,) or (N, n_tasks) or (N, n_tasks, n_classes) if a classification metric. If of shape (N, n_tasks) values can either be class-labels or probabilities of the positive class for binary classification problems. If a regression problem, must be of shape (N,) or (N, n_tasks) or (N, n_tasks, 1) if a regression metric.

  • y_pred (ArrayLike) – An ArrayLike containing predicted values for each task. Must be of shape (N, n_tasks, n_classes) if a classification metric, else must be of shape (N, n_tasks) if a regression metric.

  • w (ArrayLike, default None) – An ArrayLike containing weights for each datapoint. If specified, must be of shape (N, n_tasks).

  • n_tasks (int, default None) – The number of tasks this class is expected to handle.

  • n_classes (int, default 2) – Number of classes in data for classification tasks.

  • per_task_metrics (bool, default False) – If true, return computed metric for each task on multitask dataset.

  • use_sample_weights (bool, default False) – If set, use per-sample weights w.

  • kwargs (dict) – Will be passed on to self.metric

Returns:

A numpy array containing metric values for each task.

Return type:

np.ndarray

compute_singletask_metric(y_true: _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | bool | int | float | complex | str | bytes | _NestedSequence[bool | int | float | complex | str | bytes], y_pred: _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | bool | int | float | complex | str | bytes | _NestedSequence[bool | int | float | complex | str | bytes], w: _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | bool | int | float | complex | str | bytes | _NestedSequence[bool | int | float | complex | str | bytes] | None = None, n_samples: int | None = None, use_sample_weights: bool = False, **kwargs) float[source]

Compute a metric value.

Parameters:
  • y_true (ArrayLike) – True values array. This array must be of shape (N, n_classes) if classification and (N,) if regression.

  • y_pred (ArrayLike) – Predictions array. This array must be of shape (N, n_classes) if classification and (N,) if regression.

  • w (ArrayLike, default None) – Sample weight array. This array must be of shape (N,)

  • n_samples (int, default None (DEPRECATED)) – The number of samples in the dataset. This is N. This argument is ignored.

  • use_sample_weights (bool, default False) – If set, use per-sample weights w.

  • kwargs (dict) – Will be passed on to self.metric

Returns:

metric_value – The computed value of the metric.

Return type:

float