Metrics¶
Metrics are one of the most important parts of machine learning. Unlike traditional software, in which algorithms either work or don’t work, machine learning models work in degrees. That is, there’s a continuous range of “goodness” for a model. “Metrics” are functions which measure how well a model works. There are many different choices of metrics depending on the type of model at hand.
Metric Utilities¶
Metric utility functions allow for some common manipulations such as switching to/from one-hot representations.
- to_one_hot(y: numpy.ndarray, n_classes: int = 2) numpy.ndarray [source]¶
Transforms label vector into one-hot encoding.
Turns y into vector of shape (N, n_classes) with a one-hot encoding. Assumes that y takes values from 0 to n_classes - 1.
- Parameters
y (np.ndarray) – A vector of shape (N,) or (N, 1)
n_classes (int, default 2) – If specified use this as the number of classes. Else will try to impute it as n_classes = max(y) + 1 for arrays and as n_classes=2 for the case of scalars. Note this parameter only has value if mode==”classification”
- Returns
A numpy array of shape (N, n_classes).
- Return type
np.ndarray
- from_one_hot(y: numpy.ndarray, axis: int = 1) numpy.ndarray [source]¶
Transforms label vector from one-hot encoding.
- Parameters
y (np.ndarray) – A vector of shape (n_samples, num_classes)
axis (int, optional (default 1)) – The axis with one-hot encodings to reduce on.
- Returns
A numpy array of shape (n_samples,)
- Return type
np.ndarray
Metric Shape Handling¶
One of the trickiest parts of handling metrics correctly is making sure the shapes of input weights, predictions and labels and processed correctly. This is challenging in particular since DeepChem supports multitask, multiclass models which means that shapes must be handled with care to prevent errors. DeepChem maintains the following utility functions which attempt to facilitate shape handling for you.
- normalize_weight_shape(w: Optional[numpy.ndarray], n_samples: int, n_tasks: int) numpy.ndarray [source]¶
A utility function to correct the shape of the weight array.
This utility function is used to normalize the shapes of a given weight array.
- Parameters
w (np.ndarray) – w can be None or a scalar or a np.ndarray of shape (n_samples,) or of shape (n_samples, n_tasks). If w is a scalar, it’s assumed to be the same weight for all samples/tasks.
n_samples (int) – The number of samples in the dataset. If w is not None, we should have n_samples = w.shape[0] if w is a ndarray
n_tasks (int) – The number of tasks. If w is 2d ndarray, then we should have w.shape[1] == n_tasks.
Examples
>>> import numpy as np >>> w_out = normalize_weight_shape(None, n_samples=10, n_tasks=1) >>> (w_out == np.ones((10, 1))).all() True
- Returns
w_out – Array of shape (n_samples, n_tasks)
- Return type
np.ndarray
- normalize_labels_shape(y: numpy.ndarray, mode: Optional[str] = None, n_tasks: Optional[int] = None, n_classes: Optional[int] = None) numpy.ndarray [source]¶
A utility function to correct the shape of the labels.
- Parameters
y (np.ndarray) – y is an array of shape (N,) or (N, n_tasks) or (N, n_tasks, 1).
mode (str, default None) – If mode is “classification” or “regression”, attempts to apply data transformations.
n_tasks (int, default None) – The number of tasks this class is expected to handle.
n_classes (int, default None) – If specified use this as the number of classes. Else will try to impute it as n_classes = max(y) + 1 for arrays and as n_classes=2 for the case of scalars. Note this parameter only has value if mode==”classification”
- Returns
y_out – If mode==”classification”, y_out is an array of shape (N, n_tasks, n_classes). If mode==”regression”, y_out is an array of shape (N, n_tasks).
- Return type
np.ndarray
- normalize_prediction_shape(y: numpy.ndarray, mode: Optional[str] = None, n_tasks: Optional[int] = None, n_classes: Optional[int] = None)[source]¶
A utility function to correct the shape of provided predictions.
The metric computation classes expect that inputs for classification have the uniform shape (N, n_tasks, n_classes) and inputs for regression have the uniform shape (N, n_tasks). This function normalizes the provided input array to have the desired shape.
Examples
>>> import numpy as np >>> y = np.random.rand(10) >>> y_out = normalize_prediction_shape(y, "regression", n_tasks=1) >>> y_out.shape (10, 1)
- Parameters
y (np.ndarray) – If mode==”classification”, y is an array of shape (N,) or (N, n_tasks) or (N, n_tasks, n_classes). If mode==”regression”, y is an array of shape (N,) or (N, n_tasks)`or `(N, n_tasks, 1).
mode (str, default None) – If mode is “classification” or “regression”, attempts to apply data transformations.
n_tasks (int, default None) – The number of tasks this class is expected to handle.
n_classes (int, default None) – If specified use this as the number of classes. Else will try to impute it as n_classes = max(y) + 1 for arrays and as n_classes=2 for the case of scalars. Note this parameter only has value if mode==”classification”
- Returns
y_out – If mode==”classification”, y_out is an array of shape (N, n_tasks, n_classes). If mode==”regression”, y_out is an array of shape (N, n_tasks).
- Return type
np.ndarray
- handle_classification_mode(y: numpy.ndarray, classification_handling_mode: Optional[str], threshold_value: Optional[float] = None) numpy.ndarray [source]¶
Handle classification mode.
Transform predictions so that they have the correct classification mode.
- Parameters
y (np.ndarray) – Must be of shape (N, n_tasks, n_classes)
classification_handling_mode (str, default None) –
DeepChem models by default predict class probabilities for classification problems. This means that for a given singletask prediction, after shape normalization, the DeepChem prediction will be a numpy array of shape (N, n_classes) with class probabilities. classification_handling_mode is a string that instructs this method how to handle transforming these probabilities. It can take on the following values: - None: default value. Pass in y_pred directy into self.metric. - “threshold”: Use threshold_predictions to threshold y_pred. Use
threshold_value as the desired threshold.
”threshold-one-hot”: Use threshold_predictions to threshold y_pred using threshold_values, then apply to_one_hot to output.
threshold_value (float, default None) – If set, and classification_handling_mode is “threshold” or “threshold-one-hot” apply a thresholding operation to values with this threshold. This option isj only sensible on binary classification tasks. If float, this will be applied as a binary classification value.
- Returns
y_out – If classification_handling_mode is “direct”, then of shape (N, n_tasks, n_classes). If classification_handling_mode is “threshold”, then of shape (N, n_tasks). If `classification_handling_mode is “threshold-one-hot”, then of shape `(N, n_tasks, n_classes)”
- Return type
np.ndarray
Metric Functions¶
DeepChem has a variety of different metrics which are useful for measuring model performance. A number (but not all) of these metrics are directly sourced from sklearn
.
- matthews_corrcoef(y_true, y_pred, *, sample_weight=None)[source]¶
Compute the Matthews correlation coefficient (MCC).
The Matthews correlation coefficient is used in machine learning as a measure of the quality of binary and multiclass classifications. It takes into account true and false positives and negatives and is generally regarded as a balanced measure which can be used even if the classes are of very different sizes. The MCC is in essence a correlation coefficient value between -1 and +1. A coefficient of +1 represents a perfect prediction, 0 an average random prediction and -1 an inverse prediction. The statistic is also known as the phi coefficient. [source: Wikipedia]
Binary and multiclass labels are supported. Only in the binary case does this relate to information about true and false positives and negatives. See references below.
Read more in the User Guide.
- Parameters
y_true (array, shape = [n_samples]) – Ground truth (correct) target values.
y_pred (array, shape = [n_samples]) – Estimated targets as returned by a classifier.
sample_weight (array-like of shape (n_samples,), default=None) –
Sample weights.
New in version 0.18.
- Returns
mcc – The Matthews correlation coefficient (+1 represents a perfect prediction, 0 an average random prediction and -1 and inverse prediction).
- Return type
float
References
- 1
- 2
- 3
Gorodkin, (2004). Comparing two K-category assignments by a K-category correlation coefficient.
- 4
Examples
>>> from sklearn.metrics import matthews_corrcoef >>> y_true = [+1, +1, +1, -1] >>> y_pred = [+1, -1, +1, +1] >>> matthews_corrcoef(y_true, y_pred) -0.33...
- recall_score(y_true, y_pred, *, labels=None, pos_label=1, average='binary', sample_weight=None, zero_division='warn')[source]¶
Compute the recall.
The recall is the ratio
tp / (tp + fn)
wheretp
is the number of true positives andfn
the number of false negatives. The recall is intuitively the ability of the classifier to find all the positive samples.The best value is 1 and the worst value is 0.
Read more in the User Guide.
- Parameters
y_true (1d array-like, or label indicator array / sparse matrix) – Ground truth (correct) target values.
y_pred (1d array-like, or label indicator array / sparse matrix) – Estimated targets as returned by a classifier.
labels (array-like, default=None) –
The set of labels to include when
average != 'binary'
, and their order ifaverage is None
. Labels present in the data can be excluded, for example to calculate a multiclass average ignoring a majority negative class, while labels not present in the data will result in 0 components in a macro average. For multilabel targets, labels are column indices. By default, all labels iny_true
andy_pred
are used in sorted order.Changed in version 0.17: Parameter labels improved for multiclass problem.
pos_label (str or int, default=1) – The class to report if
average='binary'
and the data is binary. If the data are multiclass or multilabel, this will be ignored; settinglabels=[pos_label]
andaverage != 'binary'
will report scores for that label only.average ({'micro', 'macro', 'samples', 'weighted', 'binary'} or None, default='binary') –
This parameter is required for multiclass/multilabel targets. If
None
, the scores for each class are returned. Otherwise, this determines the type of averaging performed on the data:'binary'
:Only report results for the class specified by
pos_label
. This is applicable only if targets (y_{true,pred}
) are binary.'micro'
:Calculate metrics globally by counting the total true positives, false negatives and false positives.
'macro'
:Calculate metrics for each label, and find their unweighted mean. This does not take label imbalance into account.
'weighted'
:Calculate metrics for each label, and find their average weighted by support (the number of true instances for each label). This alters ‘macro’ to account for label imbalance; it can result in an F-score that is not between precision and recall. Weighted recall is equal to accuracy.
'samples'
:Calculate metrics for each instance, and find their average (only meaningful for multilabel classification where this differs from
accuracy_score()
).
sample_weight (array-like of shape (n_samples,), default=None) – Sample weights.
zero_division ("warn", 0 or 1, default="warn") – Sets the value to return when there is a zero division. If set to “warn”, this acts as 0, but warnings are also raised.
- Returns
recall – Recall of the positive class in binary classification or weighted average of the recall of each class for the multiclass task.
- Return type
float (if average is not None) or array of float of shape (n_unique_labels,)
See also
precision_recall_fscore_support
Compute precision, recall, F-measure and support for each class.
precision_score
Compute the ratio
tp / (tp + fp)
wheretp
is the number of true positives andfp
the number of false positives.balanced_accuracy_score
Compute balanced accuracy to deal with imbalanced datasets.
multilabel_confusion_matrix
Compute a confusion matrix for each class or sample.
PrecisionRecallDisplay.from_estimator
Plot precision-recall curve given an estimator and some data.
PrecisionRecallDisplay.from_predictions
Plot precision-recall curve given binary class predictions.
Notes
When
true positive + false negative == 0
, recall returns 0 and raisesUndefinedMetricWarning
. This behavior can be modified withzero_division
.Examples
>>> from sklearn.metrics import recall_score >>> y_true = [0, 1, 2, 0, 1, 2] >>> y_pred = [0, 2, 1, 0, 0, 1] >>> recall_score(y_true, y_pred, average='macro') 0.33... >>> recall_score(y_true, y_pred, average='micro') 0.33... >>> recall_score(y_true, y_pred, average='weighted') 0.33... >>> recall_score(y_true, y_pred, average=None) array([1., 0., 0.]) >>> y_true = [0, 0, 0, 0, 0, 0] >>> recall_score(y_true, y_pred, average=None) array([0.5, 0. , 0. ]) >>> recall_score(y_true, y_pred, average=None, zero_division=1) array([0.5, 1. , 1. ]) >>> # multilabel classification >>> y_true = [[0, 0, 0], [1, 1, 1], [0, 1, 1]] >>> y_pred = [[0, 0, 0], [1, 1, 1], [1, 1, 0]] >>> recall_score(y_true, y_pred, average=None) array([1. , 1. , 0.5])
- r2_score(y_true, y_pred, *, sample_weight=None, multioutput='uniform_average', force_finite=True)[source]¶
\(R^2\) (coefficient of determination) regression score function.
Best possible score is 1.0 and it can be negative (because the model can be arbitrarily worse). In the general case when the true y is non-constant, a constant model that always predicts the average y disregarding the input features would get a \(R^2\) score of 0.0.
In the particular case when
y_true
is constant, the \(R^2\) score is not finite: it is eitherNaN
(perfect predictions) or-Inf
(imperfect predictions). To prevent such non-finite numbers to pollute higher-level experiments such as a grid search cross-validation, by default these cases are replaced with 1.0 (perfect predictions) or 0.0 (imperfect predictions) respectively. You can setforce_finite
toFalse
to prevent this fix from happening.Note: when the prediction residuals have zero mean, the \(R^2\) score is identical to the
Explained Variance score
.Read more in the User Guide.
- Parameters
y_true (array-like of shape (n_samples,) or (n_samples, n_outputs)) – Ground truth (correct) target values.
y_pred (array-like of shape (n_samples,) or (n_samples, n_outputs)) – Estimated target values.
sample_weight (array-like of shape (n_samples,), default=None) – Sample weights.
multioutput ({'raw_values', 'uniform_average', 'variance_weighted'}, array-like of shape (n_outputs,) or None, default='uniform_average') –
Defines aggregating of multiple output scores. Array-like value defines weights used to average scores. Default is “uniform_average”.
- ’raw_values’ :
Returns a full set of scores in case of multioutput input.
- ’uniform_average’ :
Scores of all outputs are averaged with uniform weight.
- ’variance_weighted’ :
Scores of all outputs are averaged, weighted by the variances of each individual output.
Changed in version 0.19: Default value of multioutput is ‘uniform_average’.
force_finite (bool, default=True) –
Flag indicating if
NaN
and-Inf
scores resulting from constant data should be replaced with real numbers (1.0
if prediction is perfect,0.0
otherwise). Default isTrue
, a convenient setting for hyperparameters’ search procedures (e.g. grid search cross-validation).New in version 1.1.
- Returns
z – The \(R^2\) score or ndarray of scores if ‘multioutput’ is ‘raw_values’.
- Return type
float or ndarray of floats
Notes
This is not a symmetric function.
Unlike most other scores, \(R^2\) score may be negative (it need not actually be the square of a quantity R).
This metric is not well-defined for single samples and will return a NaN value if n_samples is less than two.
References
Examples
>>> from sklearn.metrics import r2_score >>> y_true = [3, -0.5, 2, 7] >>> y_pred = [2.5, 0.0, 2, 8] >>> r2_score(y_true, y_pred) 0.948... >>> y_true = [[0.5, 1], [-1, 1], [7, -6]] >>> y_pred = [[0, 2], [-1, 2], [8, -5]] >>> r2_score(y_true, y_pred, ... multioutput='variance_weighted') 0.938... >>> y_true = [1, 2, 3] >>> y_pred = [1, 2, 3] >>> r2_score(y_true, y_pred) 1.0 >>> y_true = [1, 2, 3] >>> y_pred = [2, 2, 2] >>> r2_score(y_true, y_pred) 0.0 >>> y_true = [1, 2, 3] >>> y_pred = [3, 2, 1] >>> r2_score(y_true, y_pred) -3.0 >>> y_true = [-2, -2, -2] >>> y_pred = [-2, -2, -2] >>> r2_score(y_true, y_pred) 1.0 >>> r2_score(y_true, y_pred, force_finite=False) nan >>> y_true = [-2, -2, -2] >>> y_pred = [-2, -2, -2 + 1e-8] >>> r2_score(y_true, y_pred) 0.0 >>> r2_score(y_true, y_pred, force_finite=False) -inf
- mean_squared_error(y_true, y_pred, *, sample_weight=None, multioutput='uniform_average', squared=True)[source]¶
Mean squared error regression loss.
Read more in the User Guide.
- Parameters
y_true (array-like of shape (n_samples,) or (n_samples, n_outputs)) – Ground truth (correct) target values.
y_pred (array-like of shape (n_samples,) or (n_samples, n_outputs)) – Estimated target values.
sample_weight (array-like of shape (n_samples,), default=None) – Sample weights.
multioutput ({'raw_values', 'uniform_average'} or array-like of shape (n_outputs,), default='uniform_average') –
Defines aggregating of multiple output values. Array-like value defines weights used to average errors.
- ’raw_values’ :
Returns a full set of errors in case of multioutput input.
- ’uniform_average’ :
Errors of all outputs are averaged with uniform weight.
squared (bool, default=True) – If True returns MSE value, if False returns RMSE value.
- Returns
loss – A non-negative floating point value (the best value is 0.0), or an array of floating point values, one for each individual target.
- Return type
float or ndarray of floats
Examples
>>> from sklearn.metrics import mean_squared_error >>> y_true = [3, -0.5, 2, 7] >>> y_pred = [2.5, 0.0, 2, 8] >>> mean_squared_error(y_true, y_pred) 0.375 >>> y_true = [3, -0.5, 2, 7] >>> y_pred = [2.5, 0.0, 2, 8] >>> mean_squared_error(y_true, y_pred, squared=False) 0.612... >>> y_true = [[0.5, 1],[-1, 1],[7, -6]] >>> y_pred = [[0, 2],[-1, 2],[8, -5]] >>> mean_squared_error(y_true, y_pred) 0.708... >>> mean_squared_error(y_true, y_pred, squared=False) 0.822... >>> mean_squared_error(y_true, y_pred, multioutput='raw_values') array([0.41666667, 1. ]) >>> mean_squared_error(y_true, y_pred, multioutput=[0.3, 0.7]) 0.825...
- mean_absolute_error(y_true, y_pred, *, sample_weight=None, multioutput='uniform_average')[source]¶
Mean absolute error regression loss.
Read more in the User Guide.
- Parameters
y_true (array-like of shape (n_samples,) or (n_samples, n_outputs)) – Ground truth (correct) target values.
y_pred (array-like of shape (n_samples,) or (n_samples, n_outputs)) – Estimated target values.
sample_weight (array-like of shape (n_samples,), default=None) – Sample weights.
multioutput ({'raw_values', 'uniform_average'} or array-like of shape (n_outputs,), default='uniform_average') –
Defines aggregating of multiple output values. Array-like value defines weights used to average errors.
- ’raw_values’ :
Returns a full set of errors in case of multioutput input.
- ’uniform_average’ :
Errors of all outputs are averaged with uniform weight.
- Returns
loss – If multioutput is ‘raw_values’, then mean absolute error is returned for each output separately. If multioutput is ‘uniform_average’ or an ndarray of weights, then the weighted average of all output errors is returned.
MAE output is non-negative floating point. The best value is 0.0.
- Return type
float or ndarray of floats
Examples
>>> from sklearn.metrics import mean_absolute_error >>> y_true = [3, -0.5, 2, 7] >>> y_pred = [2.5, 0.0, 2, 8] >>> mean_absolute_error(y_true, y_pred) 0.5 >>> y_true = [[0.5, 1], [-1, 1], [7, -6]] >>> y_pred = [[0, 2], [-1, 2], [8, -5]] >>> mean_absolute_error(y_true, y_pred) 0.75 >>> mean_absolute_error(y_true, y_pred, multioutput='raw_values') array([0.5, 1. ]) >>> mean_absolute_error(y_true, y_pred, multioutput=[0.3, 0.7]) 0.85...
- precision_score(y_true, y_pred, *, labels=None, pos_label=1, average='binary', sample_weight=None, zero_division='warn')[source]¶
Compute the precision.
The precision is the ratio
tp / (tp + fp)
wheretp
is the number of true positives andfp
the number of false positives. The precision is intuitively the ability of the classifier not to label as positive a sample that is negative.The best value is 1 and the worst value is 0.
Read more in the User Guide.
- Parameters
y_true (1d array-like, or label indicator array / sparse matrix) – Ground truth (correct) target values.
y_pred (1d array-like, or label indicator array / sparse matrix) – Estimated targets as returned by a classifier.
labels (array-like, default=None) –
The set of labels to include when
average != 'binary'
, and their order ifaverage is None
. Labels present in the data can be excluded, for example to calculate a multiclass average ignoring a majority negative class, while labels not present in the data will result in 0 components in a macro average. For multilabel targets, labels are column indices. By default, all labels iny_true
andy_pred
are used in sorted order.Changed in version 0.17: Parameter labels improved for multiclass problem.
pos_label (str or int, default=1) – The class to report if
average='binary'
and the data is binary. If the data are multiclass or multilabel, this will be ignored; settinglabels=[pos_label]
andaverage != 'binary'
will report scores for that label only.average ({'micro', 'macro', 'samples', 'weighted', 'binary'} or None, default='binary') –
This parameter is required for multiclass/multilabel targets. If
None
, the scores for each class are returned. Otherwise, this determines the type of averaging performed on the data:'binary'
:Only report results for the class specified by
pos_label
. This is applicable only if targets (y_{true,pred}
) are binary.'micro'
:Calculate metrics globally by counting the total true positives, false negatives and false positives.
'macro'
:Calculate metrics for each label, and find their unweighted mean. This does not take label imbalance into account.
'weighted'
:Calculate metrics for each label, and find their average weighted by support (the number of true instances for each label). This alters ‘macro’ to account for label imbalance; it can result in an F-score that is not between precision and recall.
'samples'
:Calculate metrics for each instance, and find their average (only meaningful for multilabel classification where this differs from
accuracy_score()
).
sample_weight (array-like of shape (n_samples,), default=None) – Sample weights.
zero_division ("warn", 0 or 1, default="warn") – Sets the value to return when there is a zero division. If set to “warn”, this acts as 0, but warnings are also raised.
- Returns
precision – Precision of the positive class in binary classification or weighted average of the precision of each class for the multiclass task.
- Return type
float (if average is not None) or array of float of shape (n_unique_labels,)
See also
precision_recall_fscore_support
Compute precision, recall, F-measure and support for each class.
recall_score
Compute the ratio
tp / (tp + fn)
wheretp
is the number of true positives andfn
the number of false negatives.PrecisionRecallDisplay.from_estimator
Plot precision-recall curve given an estimator and some data.
PrecisionRecallDisplay.from_predictions
Plot precision-recall curve given binary class predictions.
multilabel_confusion_matrix
Compute a confusion matrix for each class or sample.
Notes
When
true positive + false positive == 0
, precision returns 0 and raisesUndefinedMetricWarning
. This behavior can be modified withzero_division
.Examples
>>> from sklearn.metrics import precision_score >>> y_true = [0, 1, 2, 0, 1, 2] >>> y_pred = [0, 2, 1, 0, 0, 1] >>> precision_score(y_true, y_pred, average='macro') 0.22... >>> precision_score(y_true, y_pred, average='micro') 0.33... >>> precision_score(y_true, y_pred, average='weighted') 0.22... >>> precision_score(y_true, y_pred, average=None) array([0.66..., 0. , 0. ]) >>> y_pred = [0, 0, 0, 0, 0, 0] >>> precision_score(y_true, y_pred, average=None) array([0.33..., 0. , 0. ]) >>> precision_score(y_true, y_pred, average=None, zero_division=1) array([0.33..., 1. , 1. ]) >>> # multilabel classification >>> y_true = [[0, 0, 0], [1, 1, 1], [0, 1, 1]] >>> y_pred = [[0, 0, 0], [1, 1, 1], [1, 1, 0]] >>> precision_score(y_true, y_pred, average=None) array([0.5, 1. , 1. ])
- precision_recall_curve(y_true, probas_pred, *, pos_label=None, sample_weight=None)[source]¶
Compute precision-recall pairs for different probability thresholds.
Note: this implementation is restricted to the binary classification task.
The precision is the ratio
tp / (tp + fp)
wheretp
is the number of true positives andfp
the number of false positives. The precision is intuitively the ability of the classifier not to label as positive a sample that is negative.The recall is the ratio
tp / (tp + fn)
wheretp
is the number of true positives andfn
the number of false negatives. The recall is intuitively the ability of the classifier to find all the positive samples.The last precision and recall values are 1. and 0. respectively and do not have a corresponding threshold. This ensures that the graph starts on the y axis.
The first precision and recall values are precision=class balance and recall=1.0 which corresponds to a classifier that always predicts the positive class.
Read more in the User Guide.
- Parameters
y_true (ndarray of shape (n_samples,)) – True binary labels. If labels are not either {-1, 1} or {0, 1}, then pos_label should be explicitly given.
probas_pred (ndarray of shape (n_samples,)) – Target scores, can either be probability estimates of the positive class, or non-thresholded measure of decisions (as returned by decision_function on some classifiers).
pos_label (int or str, default=None) – The label of the positive class. When
pos_label=None
, if y_true is in {-1, 1} or {0, 1},pos_label
is set to 1, otherwise an error will be raised.sample_weight (array-like of shape (n_samples,), default=None) – Sample weights.
- Returns
precision (ndarray of shape (n_thresholds + 1,)) – Precision values such that element i is the precision of predictions with score >= thresholds[i] and the last element is 1.
recall (ndarray of shape (n_thresholds + 1,)) – Decreasing recall values such that element i is the recall of predictions with score >= thresholds[i] and the last element is 0.
thresholds (ndarray of shape (n_thresholds,)) – Increasing thresholds on the decision function used to compute precision and recall where n_thresholds = len(np.unique(probas_pred)).
See also
PrecisionRecallDisplay.from_estimator
Plot Precision Recall Curve given a binary classifier.
PrecisionRecallDisplay.from_predictions
Plot Precision Recall Curve using predictions from a binary classifier.
average_precision_score
Compute average precision from prediction scores.
det_curve
Compute error rates for different probability thresholds.
roc_curve
Compute Receiver operating characteristic (ROC) curve.
Examples
>>> import numpy as np >>> from sklearn.metrics import precision_recall_curve >>> y_true = np.array([0, 0, 1, 1]) >>> y_scores = np.array([0.1, 0.4, 0.35, 0.8]) >>> precision, recall, thresholds = precision_recall_curve( ... y_true, y_scores) >>> precision array([0.5 , 0.66666667, 0.5 , 1. , 1. ]) >>> recall array([1. , 1. , 0.5, 0.5, 0. ]) >>> thresholds array([0.1 , 0.35, 0.4 , 0.8 ])
- auc(x, y)[source]¶
Compute Area Under the Curve (AUC) using the trapezoidal rule.
This is a general function, given points on a curve. For computing the area under the ROC-curve, see
roc_auc_score()
. For an alternative way to summarize a precision-recall curve, seeaverage_precision_score()
.- Parameters
x (ndarray of shape (n,)) – X coordinates. These must be either monotonic increasing or monotonic decreasing.
y (ndarray of shape, (n,)) – Y coordinates.
- Returns
auc – Area Under the Curve.
- Return type
float
See also
roc_auc_score
Compute the area under the ROC curve.
average_precision_score
Compute average precision from prediction scores.
precision_recall_curve
Compute precision-recall pairs for different probability thresholds.
Examples
>>> import numpy as np >>> from sklearn import metrics >>> y = np.array([1, 1, 2, 2]) >>> pred = np.array([0.1, 0.4, 0.35, 0.8]) >>> fpr, tpr, thresholds = metrics.roc_curve(y, pred, pos_label=2) >>> metrics.auc(fpr, tpr) 0.75
- jaccard_score(y_true, y_pred, *, labels=None, pos_label=1, average='binary', sample_weight=None, zero_division='warn')[source]¶
Jaccard similarity coefficient score.
The Jaccard index [1], or Jaccard similarity coefficient, defined as the size of the intersection divided by the size of the union of two label sets, is used to compare set of predicted labels for a sample to the corresponding set of labels in
y_true
.Read more in the User Guide.
- Parameters
y_true (1d array-like, or label indicator array / sparse matrix) – Ground truth (correct) labels.
y_pred (1d array-like, or label indicator array / sparse matrix) – Predicted labels, as returned by a classifier.
labels (array-like of shape (n_classes,), default=None) – The set of labels to include when
average != 'binary'
, and their order ifaverage is None
. Labels present in the data can be excluded, for example to calculate a multiclass average ignoring a majority negative class, while labels not present in the data will result in 0 components in a macro average. For multilabel targets, labels are column indices. By default, all labels iny_true
andy_pred
are used in sorted order.pos_label (str or int, default=1) – The class to report if
average='binary'
and the data is binary. If the data are multiclass or multilabel, this will be ignored; settinglabels=[pos_label]
andaverage != 'binary'
will report scores for that label only.average ({'micro', 'macro', 'samples', 'weighted', 'binary'} or None, default='binary') –
If
None
, the scores for each class are returned. Otherwise, this determines the type of averaging performed on the data:'binary'
:Only report results for the class specified by
pos_label
. This is applicable only if targets (y_{true,pred}
) are binary.'micro'
:Calculate metrics globally by counting the total true positives, false negatives and false positives.
'macro'
:Calculate metrics for each label, and find their unweighted mean. This does not take label imbalance into account.
'weighted'
:Calculate metrics for each label, and find their average, weighted by support (the number of true instances for each label). This alters ‘macro’ to account for label imbalance.
'samples'
:Calculate metrics for each instance, and find their average (only meaningful for multilabel classification).
sample_weight (array-like of shape (n_samples,), default=None) – Sample weights.
zero_division ("warn", {0.0, 1.0}, default="warn") – Sets the value to return when there is a zero division, i.e. when there there are no negative values in predictions and labels. If set to “warn”, this acts like 0, but a warning is also raised.
- Returns
score – The Jaccard score. When average is not None, a single scalar is returned.
- Return type
float or ndarray of shape (n_unique_labels,), dtype=np.float64
See also
accuracy_score
Function for calculating the accuracy score.
f1_score
Function for calculating the F1 score.
multilabel_confusion_matrix
Function for computing a confusion matrix for each class or sample.
Notes
jaccard_score()
may be a poor metric if there are no positives for some samples or classes. Jaccard is undefined if there are no true or predicted labels, and our implementation will return a score of 0 with a warning.References
Examples
>>> import numpy as np >>> from sklearn.metrics import jaccard_score >>> y_true = np.array([[0, 1, 1], ... [1, 1, 0]]) >>> y_pred = np.array([[1, 1, 1], ... [1, 0, 0]])
In the binary case:
>>> jaccard_score(y_true[0], y_pred[0]) 0.6666...
In the 2D comparison case (e.g. image similarity):
>>> jaccard_score(y_true, y_pred, average="micro") 0.6
In the multilabel case:
>>> jaccard_score(y_true, y_pred, average='samples') 0.5833... >>> jaccard_score(y_true, y_pred, average='macro') 0.6666... >>> jaccard_score(y_true, y_pred, average=None) array([0.5, 0.5, 1. ])
In the multiclass case:
>>> y_pred = [0, 2, 1, 2] >>> y_true = [0, 1, 2, 2] >>> jaccard_score(y_true, y_pred, average=None) array([1. , 0. , 0.33...])
- f1_score(y_true, y_pred, *, labels=None, pos_label=1, average='binary', sample_weight=None, zero_division='warn')[source]¶
Compute the F1 score, also known as balanced F-score or F-measure.
The F1 score can be interpreted as a harmonic mean of the precision and recall, where an F1 score reaches its best value at 1 and worst score at 0. The relative contribution of precision and recall to the F1 score are equal. The formula for the F1 score is:
F1 = 2 * (precision * recall) / (precision + recall)
In the multi-class and multi-label case, this is the average of the F1 score of each class with weighting depending on the
average
parameter.Read more in the User Guide.
- Parameters
y_true (1d array-like, or label indicator array / sparse matrix) – Ground truth (correct) target values.
y_pred (1d array-like, or label indicator array / sparse matrix) – Estimated targets as returned by a classifier.
labels (array-like, default=None) –
The set of labels to include when
average != 'binary'
, and their order ifaverage is None
. Labels present in the data can be excluded, for example to calculate a multiclass average ignoring a majority negative class, while labels not present in the data will result in 0 components in a macro average. For multilabel targets, labels are column indices. By default, all labels iny_true
andy_pred
are used in sorted order.Changed in version 0.17: Parameter labels improved for multiclass problem.
pos_label (str or int, default=1) – The class to report if
average='binary'
and the data is binary. If the data are multiclass or multilabel, this will be ignored; settinglabels=[pos_label]
andaverage != 'binary'
will report scores for that label only.average ({'micro', 'macro', 'samples', 'weighted', 'binary'} or None, default='binary') –
This parameter is required for multiclass/multilabel targets. If
None
, the scores for each class are returned. Otherwise, this determines the type of averaging performed on the data:'binary'
:Only report results for the class specified by
pos_label
. This is applicable only if targets (y_{true,pred}
) are binary.'micro'
:Calculate metrics globally by counting the total true positives, false negatives and false positives.
'macro'
:Calculate metrics for each label, and find their unweighted mean. This does not take label imbalance into account.
'weighted'
:Calculate metrics for each label, and find their average weighted by support (the number of true instances for each label). This alters ‘macro’ to account for label imbalance; it can result in an F-score that is not between precision and recall.
'samples'
:Calculate metrics for each instance, and find their average (only meaningful for multilabel classification where this differs from
accuracy_score()
).
sample_weight (array-like of shape (n_samples,), default=None) – Sample weights.
zero_division ("warn", 0 or 1, default="warn") – Sets the value to return when there is a zero division, i.e. when all predictions and labels are negative. If set to “warn”, this acts as 0, but warnings are also raised.
- Returns
f1_score – F1 score of the positive class in binary classification or weighted average of the F1 scores of each class for the multiclass task.
- Return type
float or array of float, shape = [n_unique_labels]
See also
fbeta_score
Compute the F-beta score.
precision_recall_fscore_support
Compute the precision, recall, F-score, and support.
jaccard_score
Compute the Jaccard similarity coefficient score.
multilabel_confusion_matrix
Compute a confusion matrix for each class or sample.
Notes
When
true positive + false positive == 0
, precision is undefined. Whentrue positive + false negative == 0
, recall is undefined. In such cases, by default the metric will be set to 0, as will f-score, andUndefinedMetricWarning
will be raised. This behavior can be modified withzero_division
.References
Examples
>>> from sklearn.metrics import f1_score >>> y_true = [0, 1, 2, 0, 1, 2] >>> y_pred = [0, 2, 1, 0, 0, 1] >>> f1_score(y_true, y_pred, average='macro') 0.26... >>> f1_score(y_true, y_pred, average='micro') 0.33... >>> f1_score(y_true, y_pred, average='weighted') 0.26... >>> f1_score(y_true, y_pred, average=None) array([0.8, 0. , 0. ]) >>> y_true = [0, 0, 0, 0, 0, 0] >>> y_pred = [0, 0, 0, 0, 0, 0] >>> f1_score(y_true, y_pred, zero_division=1) 1.0... >>> # multilabel classification >>> y_true = [[0, 0, 0], [1, 1, 1], [0, 1, 1]] >>> y_pred = [[0, 0, 0], [1, 1, 1], [1, 1, 0]] >>> f1_score(y_true, y_pred, average=None) array([0.66666667, 1. , 0.66666667])
- roc_auc_score(y_true, y_score, *, average='macro', sample_weight=None, max_fpr=None, multi_class='raise', labels=None)[source]¶
Compute Area Under the Receiver Operating Characteristic Curve (ROC AUC) from prediction scores.
Note: this implementation can be used with binary, multiclass and multilabel classification, but some restrictions apply (see Parameters).
Read more in the User Guide.
- Parameters
y_true (array-like of shape (n_samples,) or (n_samples, n_classes)) – True labels or binary label indicators. The binary and multiclass cases expect labels with shape (n_samples,) while the multilabel case expects binary label indicators with shape (n_samples, n_classes).
y_score (array-like of shape (n_samples,) or (n_samples, n_classes)) –
Target scores.
In the binary case, it corresponds to an array of shape (n_samples,). Both probability estimates and non-thresholded decision values can be provided. The probability estimates correspond to the probability of the class with the greater label, i.e. estimator.classes_[1] and thus estimator.predict_proba(X, y)[:, 1]. The decision values corresponds to the output of estimator.decision_function(X, y). See more information in the User guide;
In the multiclass case, it corresponds to an array of shape (n_samples, n_classes) of probability estimates provided by the predict_proba method. The probability estimates must sum to 1 across the possible classes. In addition, the order of the class scores must correspond to the order of
labels
, if provided, or else to the numerical or lexicographical order of the labels iny_true
. See more information in the User guide;In the multilabel case, it corresponds to an array of shape (n_samples, n_classes). Probability estimates are provided by the predict_proba method and the non-thresholded decision values by the decision_function method. The probability estimates correspond to the probability of the class with the greater label for each output of the classifier. See more information in the User guide.
average ({'micro', 'macro', 'samples', 'weighted'} or None, default='macro') –
If
None
, the scores for each class are returned. Otherwise, this determines the type of averaging performed on the data. Note: multiclass ROC AUC currently only handles the ‘macro’ and ‘weighted’ averages. For multiclass targets, average=None is only implemented for multi_class=’ovo’.'micro'
:Calculate metrics globally by considering each element of the label indicator matrix as a label.
'macro'
:Calculate metrics for each label, and find their unweighted mean. This does not take label imbalance into account.
'weighted'
:Calculate metrics for each label, and find their average, weighted by support (the number of true instances for each label).
'samples'
:Calculate metrics for each instance, and find their average.
Will be ignored when
y_true
is binary.sample_weight (array-like of shape (n_samples,), default=None) – Sample weights.
max_fpr (float > 0 and <= 1, default=None) – If not
None
, the standardized partial AUC [2]_ over the range [0, max_fpr] is returned. For the multiclass case,max_fpr
, should be either equal toNone
or1.0
as AUC ROC partial computation currently is not supported for multiclass.multi_class ({'raise', 'ovr', 'ovo'}, default='raise') –
Only used for multiclass targets. Determines the type of configuration to use. The default value raises an error, so either
'ovr'
or'ovo'
must be passed explicitly.'ovr'
:Stands for One-vs-rest. Computes the AUC of each class against the rest [3]_ [4]_. This treats the multiclass case in the same way as the multilabel case. Sensitive to class imbalance even when
average == 'macro'
, because class imbalance affects the composition of each of the ‘rest’ groupings.'ovo'
:Stands for One-vs-one. Computes the average AUC of all possible pairwise combinations of classes 5. Insensitive to class imbalance when
average == 'macro'
.
labels (array-like of shape (n_classes,), default=None) – Only used for multiclass targets. List of labels that index the classes in
y_score
. IfNone
, the numerical or lexicographical order of the labels iny_true
is used.
- Returns
auc – Area Under the Curve score.
- Return type
float
See also
average_precision_score
Area under the precision-recall curve.
roc_curve
Compute Receiver operating characteristic (ROC) curve.
RocCurveDisplay.from_estimator
Plot Receiver Operating Characteristic (ROC) curve given an estimator and some data.
RocCurveDisplay.from_predictions
Plot Receiver Operating Characteristic (ROC) curve given the true and predicted values.
References
- 1
- 2
- 3
Provost, F., Domingos, P. (2000). Well-trained PETs: Improving probability estimation trees (Section 6.2), CeDER Working Paper #IS-00-04, Stern School of Business, New York University.
- 4
Fawcett, T. (2006). An introduction to ROC analysis. Pattern Recognition Letters, 27(8), 861-874.
- 5
Examples
Binary case:
>>> from sklearn.datasets import load_breast_cancer >>> from sklearn.linear_model import LogisticRegression >>> from sklearn.metrics import roc_auc_score >>> X, y = load_breast_cancer(return_X_y=True) >>> clf = LogisticRegression(solver="liblinear", random_state=0).fit(X, y) >>> roc_auc_score(y, clf.predict_proba(X)[:, 1]) 0.99... >>> roc_auc_score(y, clf.decision_function(X)) 0.99...
Multiclass case:
>>> from sklearn.datasets import load_iris >>> X, y = load_iris(return_X_y=True) >>> clf = LogisticRegression(solver="liblinear").fit(X, y) >>> roc_auc_score(y, clf.predict_proba(X), multi_class='ovr') 0.99...
Multilabel case:
>>> import numpy as np >>> from sklearn.datasets import make_multilabel_classification >>> from sklearn.multioutput import MultiOutputClassifier >>> X, y = make_multilabel_classification(random_state=0) >>> clf = MultiOutputClassifier(clf).fit(X, y) >>> # get a list of n_output containing probability arrays of shape >>> # (n_samples, n_classes) >>> y_pred = clf.predict_proba(X) >>> # extract the positive columns for each output >>> y_pred = np.transpose([pred[:, 1] for pred in y_pred]) >>> roc_auc_score(y, y_pred, average=None) array([0.82..., 0.86..., 0.94..., 0.85... , 0.94...]) >>> from sklearn.linear_model import RidgeClassifierCV >>> clf = RidgeClassifierCV().fit(X, y) >>> roc_auc_score(y, clf.decision_function(X), average=None) array([0.81..., 0.84... , 0.93..., 0.87..., 0.94...])
- accuracy_score(y_true, y_pred, *, normalize=True, sample_weight=None)[source]¶
Accuracy classification score.
In multilabel classification, this function computes subset accuracy: the set of labels predicted for a sample must exactly match the corresponding set of labels in y_true.
Read more in the User Guide.
- Parameters
y_true (1d array-like, or label indicator array / sparse matrix) – Ground truth (correct) labels.
y_pred (1d array-like, or label indicator array / sparse matrix) – Predicted labels, as returned by a classifier.
normalize (bool, default=True) – If
False
, return the number of correctly classified samples. Otherwise, return the fraction of correctly classified samples.sample_weight (array-like of shape (n_samples,), default=None) – Sample weights.
- Returns
score – If
normalize == True
, return the fraction of correctly classified samples (float), else returns the number of correctly classified samples (int).The best performance is 1 with
normalize == True
and the number of samples withnormalize == False
.- Return type
float
See also
balanced_accuracy_score
Compute the balanced accuracy to deal with imbalanced datasets.
jaccard_score
Compute the Jaccard similarity coefficient score.
hamming_loss
Compute the average Hamming loss or Hamming distance between two sets of samples.
zero_one_loss
Compute the Zero-one classification loss. By default, the function will return the percentage of imperfectly predicted subsets.
Notes
In binary classification, this function is equal to the jaccard_score function.
Examples
>>> from sklearn.metrics import accuracy_score >>> y_pred = [0, 2, 1, 3] >>> y_true = [0, 1, 2, 3] >>> accuracy_score(y_true, y_pred) 0.5 >>> accuracy_score(y_true, y_pred, normalize=False) 2
In the multilabel case with binary label indicators:
>>> import numpy as np >>> accuracy_score(np.array([[0, 1], [1, 1]]), np.ones((2, 2))) 0.5
- balanced_accuracy_score(y_true, y_pred, *, sample_weight=None, adjusted=False)[source]¶
Compute the balanced accuracy.
The balanced accuracy in binary and multiclass classification problems to deal with imbalanced datasets. It is defined as the average of recall obtained on each class.
The best value is 1 and the worst value is 0 when
adjusted=False
.Read more in the User Guide.
New in version 0.20.
- Parameters
y_true (1d array-like) – Ground truth (correct) target values.
y_pred (1d array-like) – Estimated targets as returned by a classifier.
sample_weight (array-like of shape (n_samples,), default=None) – Sample weights.
adjusted (bool, default=False) – When true, the result is adjusted for chance, so that random performance would score 0, while keeping perfect performance at a score of 1.
- Returns
balanced_accuracy – Balanced accuracy score.
- Return type
float
See also
average_precision_score
Compute average precision (AP) from prediction scores.
precision_score
Compute the precision score.
recall_score
Compute the recall score.
roc_auc_score
Compute Area Under the Receiver Operating Characteristic Curve (ROC AUC) from prediction scores.
Notes
Some literature promotes alternative definitions of balanced accuracy. Our definition is equivalent to
accuracy_score()
with class-balanced sample weights, and shares desirable properties with the binary case. See the User Guide.References
- 1
Brodersen, K.H.; Ong, C.S.; Stephan, K.E.; Buhmann, J.M. (2010). The balanced accuracy and its posterior distribution. Proceedings of the 20th International Conference on Pattern Recognition, 3121-24.
- 2
John. D. Kelleher, Brian Mac Namee, Aoife D’Arcy, (2015). Fundamentals of Machine Learning for Predictive Data Analytics: Algorithms, Worked Examples, and Case Studies.
Examples
>>> from sklearn.metrics import balanced_accuracy_score >>> y_true = [0, 1, 0, 0, 1, 0] >>> y_pred = [0, 1, 0, 0, 0, 1] >>> balanced_accuracy_score(y_true, y_pred) 0.625
- top_k_accuracy_score(y_true, y_score, *, k=2, normalize=True, sample_weight=None, labels=None)[source]¶
Top-k Accuracy classification score.
This metric computes the number of times where the correct label is among the top k labels predicted (ranked by predicted scores). Note that the multilabel case isn’t covered here.
Read more in the User Guide
- Parameters
y_true (array-like of shape (n_samples,)) – True labels.
y_score (array-like of shape (n_samples,) or (n_samples, n_classes)) – Target scores. These can be either probability estimates or non-thresholded decision values (as returned by decision_function on some classifiers). The binary case expects scores with shape (n_samples,) while the multiclass case expects scores with shape (n_samples, n_classes). In the multiclass case, the order of the class scores must correspond to the order of
labels
, if provided, or else to the numerical or lexicographical order of the labels iny_true
. Ify_true
does not contain all the labels,labels
must be provided.k (int, default=2) – Number of most likely outcomes considered to find the correct label.
normalize (bool, default=True) – If True, return the fraction of correctly classified samples. Otherwise, return the number of correctly classified samples.
sample_weight (array-like of shape (n_samples,), default=None) – Sample weights. If None, all samples are given the same weight.
labels (array-like of shape (n_classes,), default=None) – Multiclass only. List of labels that index the classes in
y_score
. IfNone
, the numerical or lexicographical order of the labels iny_true
is used. Ify_true
does not contain all the labels,labels
must be provided.
- Returns
score – The top-k accuracy score. The best performance is 1 with normalize == True and the number of samples with normalize == False.
- Return type
float
See also
Notes
In cases where two or more labels are assigned equal predicted scores, the labels with the highest indices will be chosen first. This might impact the result if the correct label falls after the threshold because of that.
Examples
>>> import numpy as np >>> from sklearn.metrics import top_k_accuracy_score >>> y_true = np.array([0, 1, 2, 2]) >>> y_score = np.array([[0.5, 0.2, 0.2], # 0 is in top 2 ... [0.3, 0.4, 0.2], # 1 is in top 2 ... [0.2, 0.4, 0.3], # 2 is in top 2 ... [0.7, 0.2, 0.1]]) # 2 isn't in top 2 >>> top_k_accuracy_score(y_true, y_score, k=2) 0.75 >>> # Not normalizing gives the number of "correctly" classified samples >>> top_k_accuracy_score(y_true, y_score, k=2, normalize=False) 3
- pearson_r2_score(y: numpy.ndarray, y_pred: numpy.ndarray) float [source]¶
Computes Pearson R^2 (square of Pearson correlation).
- Parameters
y (np.ndarray) – ground truth array
y_pred (np.ndarray) – predicted array
- Returns
The Pearson-R^2 score.
- Return type
float
- jaccard_index(y: numpy.ndarray, y_pred: numpy.ndarray) float [source]¶
Computes Jaccard Index which is the Intersection Over Union metric which is commonly used in image segmentation tasks.
DEPRECATED: WILL BE REMOVED IN A FUTURE VERSION OF DEEEPCHEM. USE jaccard_score instead.
- Parameters
y (np.ndarray) – ground truth array
y_pred (np.ndarray) – predicted array
- Returns
score – The jaccard index. A number between 0 and 1.
- Return type
float
- pixel_error(y: numpy.ndarray, y_pred: numpy.ndarray) float [source]¶
An error metric in case y, y_pred are images.
Defined as 1 - the maximal F-score of pixel similarity, or squared Euclidean distance between the original and the result labels.
- Parameters
y (np.ndarray) – ground truth array
y_pred (np.ndarray) – predicted array
- Returns
score – The pixel-error. A number between 0 and 1.
- Return type
float
- prc_auc_score(y: numpy.ndarray, y_pred: numpy.ndarray) float [source]¶
Compute area under precision-recall curve
- Parameters
y (np.ndarray) – A numpy array of shape (N, n_classes) or (N,) with true labels
y_pred (np.ndarray) – Of shape (N, n_classes) with class probabilities.
- Returns
The area under the precision-recall curve. A number between 0 and 1.
- Return type
float
- kappa_score(y1, y2, *, labels=None, weights=None, sample_weight=None)[source]¶
Compute Cohen’s kappa: a statistic that measures inter-annotator agreement.
This function computes Cohen’s kappa [1]_, a score that expresses the level of agreement between two annotators on a classification problem. It is defined as
\[\kappa = (p_o - p_e) / (1 - p_e)\]where \(p_o\) is the empirical probability of agreement on the label assigned to any sample (the observed agreement ratio), and \(p_e\) is the expected agreement when both annotators assign labels randomly. \(p_e\) is estimated using a per-annotator empirical prior over the class labels [2]_.
Read more in the User Guide.
- Parameters
y1 (array of shape (n_samples,)) – Labels assigned by the first annotator.
y2 (array of shape (n_samples,)) – Labels assigned by the second annotator. The kappa statistic is symmetric, so swapping
y1
andy2
doesn’t change the value.labels (array-like of shape (n_classes,), default=None) – List of labels to index the matrix. This may be used to select a subset of labels. If None, all labels that appear at least once in
y1
ory2
are used.weights ({'linear', 'quadratic'}, default=None) – Weighting type to calculate the score. None means no weighted; “linear” means linear weighted; “quadratic” means quadratic weighted.
sample_weight (array-like of shape (n_samples,), default=None) – Sample weights.
- Returns
kappa – The kappa statistic, which is a number between -1 and 1. The maximum value means complete agreement; zero or lower means chance agreement.
- Return type
float
References
- 1
- 2
- 3
- bedroc_score(y_true: numpy.ndarray, y_pred: numpy.ndarray, alpha: float = 20.0)[source]¶
Compute BEDROC metric.
BEDROC metric implemented according to Truchon and Bayley that modifies the ROC score by allowing for a factor of early recognition. Please confirm details from [1]_.
- Parameters
y_true (np.ndarray) – Binary class labels. 1 for positive class, 0 otherwise
y_pred (np.ndarray) – Predicted labels
alpha (float, default 20.0) – Early recognition parameter
- Returns
Value in [0, 1] that indicates the degree of early recognition
- Return type
float
Notes
This function requires RDKit to be installed.
References
- 1
Truchon et al. “Evaluating virtual screening methods: good and bad metrics for the “early recognition” problem.” Journal of chemical information and modeling 47.2 (2007): 488-508.
- concordance_index(y_true: numpy.ndarray, y_pred: numpy.ndarray) float [source]¶
Compute Concordance index.
Statistical metric indicates the quality of the predicted ranking. Please confirm details from [1]_.
- Parameters
y_true (np.ndarray) – continous value
y_pred (np.ndarray) – Predicted value
- Returns
score between [0,1]
- Return type
float
References
- 1
Steck, Harald, et al. “On ranking in survival analysis: Bounds on the concordance index.” Advances in neural information processing systems (2008): 1209-1216.
- get_motif_scores(encoded_sequences: numpy.ndarray, motif_names: List[str], max_scores: Optional[int] = None, return_positions: bool = False, GC_fraction: float = 0.4) numpy.ndarray [source]¶
Computes pwm log odds.
- Parameters
encoded_sequences (np.ndarray) – A numpy array of shape (N_sequences, N_letters, sequence_length, 1).
motif_names (List[str]) – List of motif file names.
max_scores (int, optional) – Get top max_scores scores.
return_positions (bool, default False) – Whether to return postions or not.
GC_fraction (float, default 0.4) – GC fraction in background sequence.
- Returns
A numpy array of complete score. The shape is (N_sequences, num_motifs, seq_length) by default. If max_scores, the shape of score array is (N_sequences, num_motifs*max_scores). If max_scores and return_positions, the shape of score array with max scores and their positions. is (N_sequences, 2*num_motifs*max_scores).
- Return type
np.ndarray
Notes
This method requires simdna to be installed.
- get_pssm_scores(encoded_sequences: numpy.ndarray, pssm: numpy.ndarray) numpy.ndarray [source]¶
Convolves pssm and its reverse complement with encoded sequences and returns the maximum score at each position of each sequence.
- Parameters
encoded_sequences (np.ndarray) – A numpy array of shape (N_sequences, N_letters, sequence_length, 1).
pssm (np.ndarray) – A numpy array of shape (4, pssm_length).
- Returns
scores – A numpy array of shape (N_sequences, sequence_length).
- Return type
np.ndarray
- in_silico_mutagenesis(model: deepchem.models.models.Model, encoded_sequences: numpy.ndarray) numpy.ndarray [source]¶
Computes in-silico-mutagenesis scores
- Parameters
model (Model) – This can be any model that accepts inputs of the required shape and produces an output of shape (N_sequences, N_tasks).
encoded_sequences (np.ndarray) – A numpy array of shape (N_sequences, N_letters, sequence_length, 1)
- Returns
A numpy array of ISM scores. The shape is (num_task, N_sequences, N_letters, sequence_length, 1).
- Return type
np.ndarray
Metric Class¶
The dc.metrics.Metric
class is a wrapper around metric
functions which interoperates with DeepChem dc.models.Model
.
- class Metric(metric: Callable[[...], float], task_averager: Optional[Callable[[...], Any]] = None, name: Optional[str] = None, threshold: Optional[float] = None, mode: Optional[str] = None, n_tasks: Optional[int] = None, classification_handling_mode: Optional[str] = None, threshold_value: Optional[float] = None)[source]¶
Wrapper class for computing user-defined metrics.
The Metric class provides a wrapper for standardizing the API around different classes of metrics that may be useful for DeepChem models. The implementation provides a few non-standard conveniences such as built-in support for multitask and multiclass metrics.
There are a variety of different metrics this class aims to support. Metrics for classification and regression that assume that values to compare are scalars are supported.
At present, this class doesn’t support metric computation on models which don’t present scalar outputs. For example, if you have a generative model which predicts images or molecules, you will need to write a custom evaluation and metric setup.
- __init__(metric: Callable[[...], float], task_averager: Optional[Callable[[...], Any]] = None, name: Optional[str] = None, threshold: Optional[float] = None, mode: Optional[str] = None, n_tasks: Optional[int] = None, classification_handling_mode: Optional[str] = None, threshold_value: Optional[float] = None)[source]¶
- Parameters
metric (function) – Function that takes args y_true, y_pred (in that order) and computes desired score. If sample weights are to be considered, metric may take in an additional keyword argument sample_weight.
task_averager (function, default None) – If not None, should be a function that averages metrics across tasks.
name (str, default None) – Name of this metric
threshold (float, default None (DEPRECATED)) – Used for binary metrics and is the threshold for the positive class.
mode (str, default None) – Should usually be “classification” or “regression.”
n_tasks (int, default None) – The number of tasks this class is expected to handle.
classification_handling_mode (str, default None) –
DeepChem models by default predict class probabilities for classification problems. This means that for a given singletask prediction, after shape normalization, the DeepChem labels and prediction will be numpy arrays of shape (n_samples, n_tasks, n_classes) with class probabilities. classification_handling_mode is a string that instructs this method how to handle transforming these probabilities. It can take on the following values: - “direct”: Pass y_true and y_pred directy into self.metric. - “threshold”: Use threshold_predictions to threshold y_true and y_pred.
Use threshold_value as the desired threshold. This converts them into arrays of shape (n_samples, n_tasks), where each element is a class index.
”threshold-one-hot”: Use threshold_predictions to threshold y_true and y_pred using threshold_values, then apply to_one_hot to output.
None: Select a mode automatically based on the metric.
threshold_value (float, default None) – If set, and classification_handling_mode is “threshold” or “threshold-one-hot”, apply a thresholding operation to values with this threshold. This option is only sensible on binary classification tasks. For multiclass problems, or if threshold_value is None, argmax() is used to select the highest probability class for each task.
- compute_metric(y_true: Union[numpy._typing._array_like._SupportsArray[numpy.dtype], numpy._typing._nested_sequence._NestedSequence[numpy._typing._array_like._SupportsArray[numpy.dtype]], bool, int, float, complex, str, bytes, numpy._typing._nested_sequence._NestedSequence[Union[bool, int, float, complex, str, bytes]]], y_pred: Union[numpy._typing._array_like._SupportsArray[numpy.dtype], numpy._typing._nested_sequence._NestedSequence[numpy._typing._array_like._SupportsArray[numpy.dtype]], bool, int, float, complex, str, bytes, numpy._typing._nested_sequence._NestedSequence[Union[bool, int, float, complex, str, bytes]]], w: Optional[Union[numpy._typing._array_like._SupportsArray[numpy.dtype], numpy._typing._nested_sequence._NestedSequence[numpy._typing._array_like._SupportsArray[numpy.dtype]], bool, int, float, complex, str, bytes, numpy._typing._nested_sequence._NestedSequence[Union[bool, int, float, complex, str, bytes]]]] = None, n_tasks: Optional[int] = None, n_classes: int = 2, per_task_metrics: bool = False, use_sample_weights: bool = False, **kwargs) Any [source]¶
Compute a performance metric for each task.
- Parameters
y_true (ArrayLike) – An ArrayLike containing true values for each task. Must be of shape (N,) or (N, n_tasks) or (N, n_tasks, n_classes) if a classification metric. If of shape (N, n_tasks) values can either be class-labels or probabilities of the positive class for binary classification problems. If a regression problem, must be of shape (N,) or (N, n_tasks) or (N, n_tasks, 1) if a regression metric.
y_pred (ArrayLike) – An ArrayLike containing predicted values for each task. Must be of shape (N, n_tasks, n_classes) if a classification metric, else must be of shape (N, n_tasks) if a regression metric.
w (ArrayLike, default None) – An ArrayLike containing weights for each datapoint. If specified, must be of shape (N, n_tasks).
n_tasks (int, default None) – The number of tasks this class is expected to handle.
n_classes (int, default 2) – Number of classes in data for classification tasks.
per_task_metrics (bool, default False) – If true, return computed metric for each task on multitask dataset.
use_sample_weights (bool, default False) – If set, use per-sample weights w.
kwargs (dict) – Will be passed on to self.metric
- Returns
A numpy array containing metric values for each task.
- Return type
np.ndarray
- compute_singletask_metric(y_true: Union[numpy._typing._array_like._SupportsArray[numpy.dtype], numpy._typing._nested_sequence._NestedSequence[numpy._typing._array_like._SupportsArray[numpy.dtype]], bool, int, float, complex, str, bytes, numpy._typing._nested_sequence._NestedSequence[Union[bool, int, float, complex, str, bytes]]], y_pred: Union[numpy._typing._array_like._SupportsArray[numpy.dtype], numpy._typing._nested_sequence._NestedSequence[numpy._typing._array_like._SupportsArray[numpy.dtype]], bool, int, float, complex, str, bytes, numpy._typing._nested_sequence._NestedSequence[Union[bool, int, float, complex, str, bytes]]], w: Optional[Union[numpy._typing._array_like._SupportsArray[numpy.dtype], numpy._typing._nested_sequence._NestedSequence[numpy._typing._array_like._SupportsArray[numpy.dtype]], bool, int, float, complex, str, bytes, numpy._typing._nested_sequence._NestedSequence[Union[bool, int, float, complex, str, bytes]]]] = None, n_samples: Optional[int] = None, use_sample_weights: bool = False, **kwargs) float [source]¶
Compute a metric value.
- Parameters
y_true (ArrayLike) – True values array. This array must be of shape (N, n_classes) if classification and (N,) if regression.
y_pred (ArrayLike) – Predictions array. This array must be of shape (N, n_classes) if classification and (N,) if regression.
w (ArrayLike, default None) – Sample weight array. This array must be of shape (N,)
n_samples (int, default None (DEPRECATED)) – The number of samples in the dataset. This is N. This argument is ignored.
use_sample_weights (bool, default False) – If set, use per-sample weights w.
kwargs (dict) – Will be passed on to self.metric
- Returns
metric_value – The computed value of the metric.
- Return type
float