Metrics are one of the most important parts of machine learning. Unlike
traditional software, in which algorithms either work or don’t work,
machine learning models work in degrees. That is, there’s a continuous
range of “goodness” for a model. “Metrics” are functions which measure
how well a model works. There are many different choices of metrics
depending on the type of model at hand.
Turns y into vector of shape (N, n_classes) with a one-hot
encoding. Assumes that y takes values from 0 to n_classes - 1.
Parameters:
y (np.ndarray) – A vector of shape (N,) or (N, 1)
n_classes (int, default 2) – If specified use this as the number of classes. Else will try to
impute it as n_classes = max(y) + 1 for arrays and as
n_classes=2 for the case of scalars. Note this parameter only
has value if mode==”classification”
One of the trickiest parts of handling metrics correctly is making sure the
shapes of input weights, predictions and labels and processed correctly. This
is challenging in particular since DeepChem supports multitask, multiclass
models which means that shapes must be handled with care to prevent errors.
DeepChem maintains the following utility functions which attempt to
facilitate shape handling for you.
A utility function to correct the shape of the weight array.
This utility function is used to normalize the shapes of a given
weight array.
Parameters:
w (np.ndarray) – w can be None or a scalar or a np.ndarray of shape
(n_samples,) or of shape (n_samples, n_tasks). If w is a
scalar, it’s assumed to be the same weight for all samples/tasks.
n_samples (int) – The number of samples in the dataset. If w is not None, we should
have n_samples = w.shape[0] if w is a ndarray
n_tasks (int) – The number of tasks. If w is 2d ndarray, then we should have
w.shape[1] == n_tasks.
A utility function to correct the shape of the labels.
Parameters:
y (np.ndarray) – y is an array of shape (N,) or (N, n_tasks) or (N, n_tasks, 1).
mode (str, default None) – If mode is “classification” or “regression”, attempts to apply
data transformations.
n_tasks (int, default None) – The number of tasks this class is expected to handle.
n_classes (int, default None) – If specified use this as the number of classes. Else will try to
impute it as n_classes = max(y) + 1 for arrays and as
n_classes=2 for the case of scalars. Note this parameter only
has value if mode==”classification”
Returns:
y_out – If mode==”classification”, y_out is an array of shape (N,
n_tasks, n_classes). If mode==”regression”, y_out is an array
of shape (N, n_tasks).
A utility function to correct the shape of provided predictions.
The metric computation classes expect that inputs for classification
have the uniform shape (N, n_tasks, n_classes) and inputs for
regression have the uniform shape (N, n_tasks). This function
normalizes the provided input array to have the desired shape.
y (np.ndarray) – If mode==”classification”, y is an array of shape (N,) or
(N, n_tasks) or (N, n_tasks, n_classes). If
mode==”regression”, y is an array of shape (N,) or (N,
n_tasks)`or `(N, n_tasks, 1).
mode (str, default None) – If mode is “classification” or “regression”, attempts to apply
data transformations.
n_tasks (int, default None) – The number of tasks this class is expected to handle.
n_classes (int, default None) – If specified use this as the number of classes. Else will try to
impute it as n_classes = max(y) + 1 for arrays and as
n_classes=2 for the case of scalars. Note this parameter only
has value if mode==”classification”
Returns:
y_out – If mode==”classification”, y_out is an array of shape (N,
n_tasks, n_classes). If mode==”regression”, y_out is an array
of shape (N, n_tasks).
DeepChem models by default predict class probabilities for
classification problems. This means that for a given singletask
prediction, after shape normalization, the DeepChem prediction will be a
numpy array of shape (N, n_classes) with class probabilities.
classification_handling_mode is a string that instructs this method
how to handle transforming these probabilities. It can take on the
following values:
- None: default value. Pass in y_pred directy into self.metric.
- “threshold”: Use threshold_predictions to threshold y_pred. Use
threshold_value as the desired threshold.
”threshold-one-hot”: Use threshold_predictions to threshold y_pred
using threshold_values, then apply to_one_hot to output.
threshold_value (float, default None) – If set, and classification_handling_mode is “threshold” or
“threshold-one-hot” apply a thresholding operation to values with this
threshold. This option isj only sensible on binary classification tasks.
If float, this will be applied as a binary classification value.
Returns:
y_out – If classification_handling_mode is “direct”, then of shape (N, n_tasks, n_classes).
If classification_handling_mode is “threshold”, then of shape (N, n_tasks).
If `classification_handling_mode is “threshold-one-hot”, then of shape `(N, n_tasks, n_classes)”
DeepChem has a variety of different metrics which are useful for measuring model performance. A number (but not all) of these metrics are directly sourced from sklearn.
Compute the Matthews correlation coefficient (MCC).
The Matthews correlation coefficient is used in machine learning as a
measure of the quality of binary and multiclass classifications. It takes
into account true and false positives and negatives and is generally
regarded as a balanced measure which can be used even if the classes are of
very different sizes. The MCC is in essence a correlation coefficient value
between -1 and +1. A coefficient of +1 represents a perfect prediction, 0
an average random prediction and -1 an inverse prediction. The statistic
is also known as the phi coefficient. [source: Wikipedia]
Binary and multiclass labels are supported. Only in the binary case does
this relate to information about true and false positives and negatives.
See references below.
Read more in the User Guide.
Parameters:
y_true (array-like of shape (n_samples,)) – Ground truth (correct) target values.
y_pred (array-like of shape (n_samples,)) – Estimated targets as returned by a classifier.
sample_weight (array-like of shape (n_samples,), default=None) –
Sample weights.
New in version 0.18.
Returns:
mcc – The Matthews correlation coefficient (+1 represents a perfect
prediction, 0 an average random prediction and -1 and inverse
prediction).
The recall is the ratio tp/(tp+fn) where tp is the number of
true positives and fn the number of false negatives. The recall is
intuitively the ability of the classifier to find all the positive samples.
The best value is 1 and the worst value is 0.
Support beyond term:binary targets is achieved by treating multiclass
and multilabel data as a collection of binary problems, one for each
label. For the binary case, setting average=’binary’ will return
recall for pos_label. If average is not ‘binary’, pos_label is ignored
and recall for both classes are computed then averaged or both returned (when
average=None). Similarly, for multiclass and multilabel targets,
recall for all labels are either returned or averaged depending on the average
parameter. Use labels specify the set of labels to calculate recall for.
Read more in the User Guide.
Parameters:
y_true (1d array-like, or label indicator array / sparse matrix) – Ground truth (correct) target values.
y_pred (1d array-like, or label indicator array / sparse matrix) – Estimated targets as returned by a classifier.
labels (array-like, default=None) –
The set of labels to include when average != ‘binary’, and their
order if average is None. Labels present in the data can be
excluded, for example in multiclass classification to exclude a “negative
class”. Labels not present in the data can be included and will be
“assigned” 0 samples. For multilabel targets, labels are column indices.
By default, all labels in y_true and y_pred are used in sorted order.
Changed in version 0.17: Parameter labels improved for multiclass problem.
pos_label (int, float, bool or str, default=1) – The class to report if average=’binary’ and the data is binary,
otherwise this parameter is ignored.
For multiclass or multilabel targets, set labels=[pos_label] and
average != ‘binary’ to report metrics for one label only.
average ({'micro', 'macro', 'samples', 'weighted', 'binary'} or None, default='binary') –
This parameter is required for multiclass/multilabel targets.
If None, the scores for each class are returned. Otherwise, this
determines the type of averaging performed on the data:
'binary':
Only report results for the class specified by pos_label.
This is applicable only if targets (y_{true,pred}) are binary.
'micro':
Calculate metrics globally by counting the total true positives,
false negatives and false positives.
'macro':
Calculate metrics for each label, and find their unweighted
mean. This does not take label imbalance into account.
'weighted':
Calculate metrics for each label, and find their average weighted
by support (the number of true instances for each label). This
alters ‘macro’ to account for label imbalance; it can result in an
F-score that is not between precision and recall. Weighted recall
is equal to accuracy.
'samples':
Calculate metrics for each instance, and find their average (only
meaningful for multilabel classification where this differs from
accuracy_score()).
sample_weight (array-like of shape (n_samples,), default=None) – Sample weights.
\(R^2\) (coefficient of determination) regression score function.
Best possible score is 1.0 and it can be negative (because the
model can be arbitrarily worse). In the general case when the true y is
non-constant, a constant model that always predicts the average y
disregarding the input features would get a \(R^2\) score of 0.0.
In the particular case when y_true is constant, the \(R^2\) score
is not finite: it is either NaN (perfect predictions) or -Inf
(imperfect predictions). To prevent such non-finite numbers to pollute
higher-level experiments such as a grid search cross-validation, by default
these cases are replaced with 1.0 (perfect predictions) or 0.0 (imperfect
predictions) respectively. You can set force_finite to False to
prevent this fix from happening.
Note: when the prediction residuals have zero mean, the \(R^2\) score
is identical to the
ExplainedVariancescore.
Read more in the User Guide.
Parameters:
y_true (array-like of shape (n_samples,) or (n_samples, n_outputs)) – Ground truth (correct) target values.
y_pred (array-like of shape (n_samples,) or (n_samples, n_outputs)) – Estimated target values.
sample_weight (array-like of shape (n_samples,), default=None) – Sample weights.
multioutput ({'raw_values', 'uniform_average', 'variance_weighted'}, array-like of shape (n_outputs,) or None, default='uniform_average') –
Defines aggregating of multiple output scores.
Array-like value defines weights used to average scores.
Default is “uniform_average”.
’raw_values’ :
Returns a full set of scores in case of multioutput input.
’uniform_average’ :
Scores of all outputs are averaged with uniform weight.
’variance_weighted’ :
Scores of all outputs are averaged, weighted by the variances
of each individual output.
Changed in version 0.19: Default value of multioutput is ‘uniform_average’.
force_finite (bool, default=True) –
Flag indicating if NaN and -Inf scores resulting from constant
data should be replaced with real numbers (1.0 if prediction is
perfect, 0.0 otherwise). Default is True, a convenient setting
for hyperparameters’ search procedures (e.g. grid search
cross-validation).
New in version 1.1.
Returns:
z – The \(R^2\) score or ndarray of scores if ‘multioutput’ is
‘raw_values’.
Return type:
float or ndarray of floats
Notes
This is not a symmetric function.
Unlike most other scores, \(R^2\) score may be negative (it need not
actually be the square of a quantity R).
This metric is not well-defined for single samples and will return a NaN
value if n_samples is less than two.
y_true (array-like of shape (n_samples,) or (n_samples, n_outputs)) – Ground truth (correct) target values.
y_pred (array-like of shape (n_samples,) or (n_samples, n_outputs)) – Estimated target values.
sample_weight (array-like of shape (n_samples,), default=None) – Sample weights.
multioutput ({'raw_values', 'uniform_average'} or array-like of shape (n_outputs,), default='uniform_average') –
Defines aggregating of multiple output values.
Array-like value defines weights used to average errors.
’raw_values’ :
Returns a full set of errors in case of multioutput input.
’uniform_average’ :
Errors of all outputs are averaged with uniform weight.
squared (bool, default=True) –
If True returns MSE value, if False returns RMSE value.
Deprecated since version 1.4: squared is deprecated in 1.4 and will be removed in 1.6.
Use root_mean_squared_error()
instead to calculate the root mean squared error.
Returns:
loss – A non-negative floating point value (the best value is 0.0), or an
array of floating point values, one for each individual target.
y_true (array-like of shape (n_samples,) or (n_samples, n_outputs)) – Ground truth (correct) target values.
y_pred (array-like of shape (n_samples,) or (n_samples, n_outputs)) – Estimated target values.
sample_weight (array-like of shape (n_samples,), default=None) – Sample weights.
multioutput ({'raw_values', 'uniform_average'} or array-like of shape (n_outputs,), default='uniform_average') –
Defines aggregating of multiple output values.
Array-like value defines weights used to average errors.
’raw_values’ :
Returns a full set of errors in case of multioutput input.
’uniform_average’ :
Errors of all outputs are averaged with uniform weight.
Returns:
loss – If multioutput is ‘raw_values’, then mean absolute error is returned
for each output separately.
If multioutput is ‘uniform_average’ or an ndarray of weights, then the
weighted average of all output errors is returned.
MAE output is non-negative floating point. The best value is 0.0.
The precision is the ratio tp/(tp+fp) where tp is the number of
true positives and fp the number of false positives. The precision is
intuitively the ability of the classifier not to label as positive a sample
that is negative.
The best value is 1 and the worst value is 0.
Support beyond term:binary targets is achieved by treating multiclass
and multilabel data as a collection of binary problems, one for each
label. For the binary case, setting average=’binary’ will return
precision for pos_label. If average is not ‘binary’, pos_label is ignored
and precision for both classes are computed, then averaged or both returned (when
average=None). Similarly, for multiclass and multilabel targets,
precision for all labels are either returned or averaged depending on the
average parameter. Use labels specify the set of labels to calculate precision
for.
Read more in the User Guide.
Parameters:
y_true (1d array-like, or label indicator array / sparse matrix) – Ground truth (correct) target values.
y_pred (1d array-like, or label indicator array / sparse matrix) – Estimated targets as returned by a classifier.
labels (array-like, default=None) –
The set of labels to include when average != ‘binary’, and their
order if average is None. Labels present in the data can be
excluded, for example in multiclass classification to exclude a “negative
class”. Labels not present in the data can be included and will be
“assigned” 0 samples. For multilabel targets, labels are column indices.
By default, all labels in y_true and y_pred are used in sorted order.
Changed in version 0.17: Parameter labels improved for multiclass problem.
pos_label (int, float, bool or str, default=1) – The class to report if average=’binary’ and the data is binary,
otherwise this parameter is ignored.
For multiclass or multilabel targets, set labels=[pos_label] and
average != ‘binary’ to report metrics for one label only.
average ({'micro', 'macro', 'samples', 'weighted', 'binary'} or None, default='binary') –
This parameter is required for multiclass/multilabel targets.
If None, the scores for each class are returned. Otherwise, this
determines the type of averaging performed on the data:
'binary':
Only report results for the class specified by pos_label.
This is applicable only if targets (y_{true,pred}) are binary.
'micro':
Calculate metrics globally by counting the total true positives,
false negatives and false positives.
'macro':
Calculate metrics for each label, and find their unweighted
mean. This does not take label imbalance into account.
'weighted':
Calculate metrics for each label, and find their average weighted
by support (the number of true instances for each label). This
alters ‘macro’ to account for label imbalance; it can result in an
F-score that is not between precision and recall.
'samples':
Calculate metrics for each instance, and find their average (only
meaningful for multilabel classification where this differs from
accuracy_score()).
sample_weight (array-like of shape (n_samples,), default=None) – Sample weights.
Compute precision-recall pairs for different probability thresholds.
Note: this implementation is restricted to the binary classification task.
The precision is the ratio tp/(tp+fp) where tp is the number of
true positives and fp the number of false positives. The precision is
intuitively the ability of the classifier not to label as positive a sample
that is negative.
The recall is the ratio tp/(tp+fn) where tp is the number of
true positives and fn the number of false negatives. The recall is
intuitively the ability of the classifier to find all the positive samples.
The last precision and recall values are 1. and 0. respectively and do not
have a corresponding threshold. This ensures that the graph starts on the
y axis.
The first precision and recall values are precision=class balance and recall=1.0
which corresponds to a classifier that always predicts the positive class.
Read more in the User Guide.
Parameters:
y_true (array-like of shape (n_samples,)) – True binary labels. If labels are not either {-1, 1} or {0, 1}, then
pos_label should be explicitly given.
probas_pred (array-like of shape (n_samples,)) – Target scores, can either be probability estimates of the positive
class, or non-thresholded measure of decisions (as returned by
decision_function on some classifiers).
pos_label (int, float, bool or str, default=None) – The label of the positive class.
When pos_label=None, if y_true is in {-1, 1} or {0, 1},
pos_label is set to 1, otherwise an error will be raised.
sample_weight (array-like of shape (n_samples,), default=None) – Sample weights.
drop_intermediate (bool, default=False) –
Whether to drop some suboptimal thresholds which would not appear
on a plotted precision-recall curve. This is useful in order to create
lighter precision-recall curves.
New in version 1.3.
Returns:
precision (ndarray of shape (n_thresholds + 1,)) – Precision values such that element i is the precision of
predictions with score >= thresholds[i] and the last element is 1.
recall (ndarray of shape (n_thresholds + 1,)) – Decreasing recall values such that element i is the recall of
predictions with score >= thresholds[i] and the last element is 0.
thresholds (ndarray of shape (n_thresholds,)) – Increasing thresholds on the decision function used to compute
precision and recall where n_thresholds = len(np.unique(probas_pred)).
See also
PrecisionRecallDisplay.from_estimator
Plot Precision Recall Curve given a binary classifier.
PrecisionRecallDisplay.from_predictions
Plot Precision Recall Curve using predictions from a binary classifier.
average_precision_score
Compute average precision from prediction scores.
det_curve
Compute error rates for different probability thresholds.
Compute Area Under the Curve (AUC) using the trapezoidal rule.
This is a general function, given points on a curve. For computing the
area under the ROC-curve, see roc_auc_score(). For an alternative
way to summarize a precision-recall curve, see
average_precision_score().
Parameters:
x (array-like of shape (n,)) – X coordinates. These must be either monotonic increasing or monotonic
decreasing.
The Jaccard index [1], or Jaccard similarity coefficient, defined as
the size of the intersection divided by the size of the union of two label
sets, is used to compare set of predicted labels for a sample to the
corresponding set of labels in y_true.
Support beyond term:binary targets is achieved by treating multiclass
and multilabel data as a collection of binary problems, one for each
label. For the binary case, setting average=’binary’ will return the
Jaccard similarity coefficient for pos_label. If average is not ‘binary’,
pos_label is ignored and scores for both classes are computed, then averaged or
both returned (when average=None). Similarly, for multiclass and
multilabel targets, scores for all labels are either returned or
averaged depending on the average parameter. Use labels specify the set of
labels to calculate the score for.
Read more in the User Guide.
Parameters:
y_true (1d array-like, or label indicator array / sparse matrix) – Ground truth (correct) labels.
y_pred (1d array-like, or label indicator array / sparse matrix) – Predicted labels, as returned by a classifier.
labels (array-like of shape (n_classes,), default=None) – The set of labels to include when average != ‘binary’, and their
order if average is None. Labels present in the data can be
excluded, for example in multiclass classification to exclude a “negative
class”. Labels not present in the data can be included and will be
“assigned” 0 samples. For multilabel targets, labels are column indices.
By default, all labels in y_true and y_pred are used in sorted order.
pos_label (int, float, bool or str, default=1) – The class to report if average=’binary’ and the data is binary,
otherwise this parameter is ignored.
For multiclass or multilabel targets, set labels=[pos_label] and
average != ‘binary’ to report metrics for one label only.
average ({'micro', 'macro', 'samples', 'weighted', 'binary'} or None, default='binary') –
If None, the scores for each class are returned. Otherwise, this
determines the type of averaging performed on the data:
'binary':
Only report results for the class specified by pos_label.
This is applicable only if targets (y_{true,pred}) are binary.
'micro':
Calculate metrics globally by counting the total true positives,
false negatives and false positives.
'macro':
Calculate metrics for each label, and find their unweighted
mean. This does not take label imbalance into account.
'weighted':
Calculate metrics for each label, and find their average, weighted
by support (the number of true instances for each label). This
alters ‘macro’ to account for label imbalance.
'samples':
Calculate metrics for each instance, and find their average (only
meaningful for multilabel classification).
sample_weight (array-like of shape (n_samples,), default=None) – Sample weights.
zero_division ("warn", {0.0, 1.0}, default="warn") – Sets the value to return when there is a zero division, i.e. when there
there are no negative values in predictions and labels. If set to
“warn”, this acts like 0, but a warning is also raised.
Returns:
score – The Jaccard score. When average is not None, a single scalar is
returned.
Return type:
float or ndarray of shape (n_unique_labels,), dtype=np.float64
Function for computing a confusion matrix for each class or sample.
Notes
jaccard_score() may be a poor metric if there are no
positives for some samples or classes. Jaccard is undefined if there are
no true or predicted labels, and our implementation will return a score
of 0 with a warning.
Compute the F1 score, also known as balanced F-score or F-measure.
The F1 score can be interpreted as a harmonic mean of the precision and
recall, where an F1 score reaches its best value at 1 and worst score at 0.
The relative contribution of precision and recall to the F1 score are
equal. The formula for the F1 score is:
Where \(\text{TP}\) is the number of true positives, \(\text{FN}\) is the
number of false negatives, and \(\text{FP}\) is the number of false positives.
F1 is by default
calculated as 0.0 when there are no true positives, false negatives, or
false positives.
Support beyond binary targets is achieved by treating multiclass
and multilabel data as a collection of binary problems, one for each
label. For the binary case, setting average=’binary’ will return
F1 score for pos_label. If average is not ‘binary’, pos_label is ignored
and F1 score for both classes are computed, then averaged or both returned (when
average=None). Similarly, for multiclass and multilabel targets,
F1 score for all labels are either returned or averaged depending on the
average parameter. Use labels specify the set of labels to calculate F1 score
for.
Read more in the User Guide.
Parameters:
y_true (1d array-like, or label indicator array / sparse matrix) – Ground truth (correct) target values.
y_pred (1d array-like, or label indicator array / sparse matrix) – Estimated targets as returned by a classifier.
labels (array-like, default=None) –
The set of labels to include when average != ‘binary’, and their
order if average is None. Labels present in the data can be
excluded, for example in multiclass classification to exclude a “negative
class”. Labels not present in the data can be included and will be
“assigned” 0 samples. For multilabel targets, labels are column indices.
By default, all labels in y_true and y_pred are used in sorted order.
Changed in version 0.17: Parameter labels improved for multiclass problem.
pos_label (int, float, bool or str, default=1) – The class to report if average=’binary’ and the data is binary,
otherwise this parameter is ignored.
For multiclass or multilabel targets, set labels=[pos_label] and
average != ‘binary’ to report metrics for one label only.
average ({'micro', 'macro', 'samples', 'weighted', 'binary'} or None, default='binary') –
This parameter is required for multiclass/multilabel targets.
If None, the scores for each class are returned. Otherwise, this
determines the type of averaging performed on the data:
'binary':
Only report results for the class specified by pos_label.
This is applicable only if targets (y_{true,pred}) are binary.
'micro':
Calculate metrics globally by counting the total true positives,
false negatives and false positives.
'macro':
Calculate metrics for each label, and find their unweighted
mean. This does not take label imbalance into account.
'weighted':
Calculate metrics for each label, and find their average weighted
by support (the number of true instances for each label). This
alters ‘macro’ to account for label imbalance; it can result in an
F-score that is not between precision and recall.
'samples':
Calculate metrics for each instance, and find their average (only
meaningful for multilabel classification where this differs from
accuracy_score()).
sample_weight (array-like of shape (n_samples,), default=None) – Sample weights.
Compute a confusion matrix for each class or sample.
Notes
When truepositive+falsepositive+falsenegative==0 (i.e. a class
is completely absent from both y_true or y_pred), f-score is
undefined. In such cases, by default f-score will be set to 0.0, and
UndefinedMetricWarning will be raised. This behavior can be modified by
setting the zero_division parameter.
Compute Area Under the Receiver Operating Characteristic Curve (ROC AUC) from prediction scores.
Note: this implementation can be used with binary, multiclass and
multilabel classification, but some restrictions apply (see Parameters).
Read more in the User Guide.
Parameters:
y_true (array-like of shape (n_samples,) or (n_samples, n_classes)) – True labels or binary label indicators. The binary and multiclass cases
expect labels with shape (n_samples,) while the multilabel case expects
binary label indicators with shape (n_samples, n_classes).
y_score (array-like of shape (n_samples,) or (n_samples, n_classes)) –
Target scores.
In the binary case, it corresponds to an array of shape
(n_samples,). Both probability estimates and non-thresholded
decision values can be provided. The probability estimates correspond
to the probability of the class with the greater label,
i.e. estimator.classes_[1] and thus
estimator.predict_proba(X, y)[:, 1]. The decision values
corresponds to the output of estimator.decision_function(X, y).
See more information in the User guide;
In the multiclass case, it corresponds to an array of shape
(n_samples, n_classes) of probability estimates provided by the
predict_proba method. The probability estimates must
sum to 1 across the possible classes. In addition, the order of the
class scores must correspond to the order of labels,
if provided, or else to the numerical or lexicographical order of
the labels in y_true. See more information in the
User guide;
In the multilabel case, it corresponds to an array of shape
(n_samples, n_classes). Probability estimates are provided by the
predict_proba method and the non-thresholded decision values by
the decision_function method. The probability estimates correspond
to the probability of the class with the greater label for each
output of the classifier. See more information in the
User guide.
average ({'micro', 'macro', 'samples', 'weighted'} or None, default='macro') –
If None, the scores for each class are returned.
Otherwise, this determines the type of averaging performed on the data.
Note: multiclass ROC AUC currently only handles the ‘macro’ and
‘weighted’ averages. For multiclass targets, average=None is only
implemented for multi_class=’ovr’ and average=’micro’ is only
implemented for multi_class=’ovr’.
'micro':
Calculate metrics globally by considering each element of the label
indicator matrix as a label.
'macro':
Calculate metrics for each label, and find their unweighted
mean. This does not take label imbalance into account.
'weighted':
Calculate metrics for each label, and find their average, weighted
by support (the number of true instances for each label).
'samples':
Calculate metrics for each instance, and find their average.
Will be ignored when y_true is binary.
sample_weight (array-like of shape (n_samples,), default=None) – Sample weights.
max_fpr (float > 0 and <= 1, default=None) – If not None, the standardized partial AUC [2]_ over the range
[0, max_fpr] is returned. For the multiclass case, max_fpr,
should be either equal to None or 1.0 as AUC ROC partial
computation currently is not supported for multiclass.
Only used for multiclass targets. Determines the type of configuration
to use. The default value raises an error, so either
'ovr' or 'ovo' must be passed explicitly.
'ovr':
Stands for One-vs-rest. Computes the AUC of each class
against the rest [3]_[4]_. This
treats the multiclass case in the same way as the multilabel case.
Sensitive to class imbalance even when average=='macro',
because class imbalance affects the composition of each of the
‘rest’ groupings.
'ovo':
Stands for One-vs-one. Computes the average AUC of all
possible pairwise combinations of classes [5].
Insensitive to class imbalance when
average=='macro'.
labels (array-like of shape (n_classes,), default=None) – Only used for multiclass targets. List of labels that index the
classes in y_score. If None, the numerical or lexicographical
order of the labels in y_true is used.
Plot Receiver Operating Characteristic (ROC) curve given an estimator and some data.
RocCurveDisplay.from_predictions
Plot Receiver Operating Characteristic (ROC) curve given the true and predicted values.
Notes
The Gini Coefficient is a summary measure of the ranking ability of binary
classifiers. It is expressed using the area under of the ROC as follows:
G = 2 * AUC - 1
Where G is the Gini coefficient and AUC is the ROC-AUC score. This normalisation
will ensure that random guessing will yield a score of 0 in expectation, and it is
upper bounded by 1.
>>> importnumpyasnp>>> fromsklearn.datasetsimportmake_multilabel_classification>>> fromsklearn.multioutputimportMultiOutputClassifier>>> X,y=make_multilabel_classification(random_state=0)>>> clf=MultiOutputClassifier(clf).fit(X,y)>>> # get a list of n_output containing probability arrays of shape>>> # (n_samples, n_classes)>>> y_pred=clf.predict_proba(X)>>> # extract the positive columns for each output>>> y_pred=np.transpose([pred[:,1]forprediny_pred])>>> roc_auc_score(y,y_pred,average=None)array([0.82..., 0.86..., 0.94..., 0.85... , 0.94...])>>> fromsklearn.linear_modelimportRidgeClassifierCV>>> clf=RidgeClassifierCV().fit(X,y)>>> roc_auc_score(y,clf.decision_function(X),average=None)array([0.81..., 0.84... , 0.93..., 0.87..., 0.94...])
In multilabel classification, this function computes subset accuracy:
the set of labels predicted for a sample must exactly match the
corresponding set of labels in y_true.
Read more in the User Guide.
Parameters:
y_true (1d array-like, or label indicator array / sparse matrix) – Ground truth (correct) labels.
y_pred (1d array-like, or label indicator array / sparse matrix) – Predicted labels, as returned by a classifier.
normalize (bool, default=True) – If False, return the number of correctly classified samples.
Otherwise, return the fraction of correctly classified samples.
sample_weight (array-like of shape (n_samples,), default=None) – Sample weights.
Returns:
score – If normalize==True, return the fraction of correctly
classified samples (float), else returns the number of correctly
classified samples (int).
The best performance is 1 with normalize==True and the number
of samples with normalize==False.
The balanced accuracy in binary and multiclass classification problems to
deal with imbalanced datasets. It is defined as the average of recall
obtained on each class.
The best value is 1 and the worst value is 0 when adjusted=False.
Read more in the User Guide.
New in version 0.20.
Parameters:
y_true (array-like of shape (n_samples,)) – Ground truth (correct) target values.
y_pred (array-like of shape (n_samples,)) – Estimated targets as returned by a classifier.
sample_weight (array-like of shape (n_samples,), default=None) – Sample weights.
adjusted (bool, default=False) – When true, the result is adjusted for chance, so that random
performance would score 0, while keeping perfect performance at a score
of 1.
Returns:
balanced_accuracy – Balanced accuracy score.
Return type:
float
See also
average_precision_score
Compute average precision (AP) from prediction scores.
Compute Area Under the Receiver Operating Characteristic Curve (ROC AUC) from prediction scores.
Notes
Some literature promotes alternative definitions of balanced accuracy. Our
definition is equivalent to accuracy_score() with class-balanced
sample weights, and shares desirable properties with the binary case.
See the User Guide.
This metric computes the number of times where the correct label is among
the top k labels predicted (ranked by predicted scores). Note that the
multilabel case isn’t covered here.
Read more in the User Guide
Parameters:
y_true (array-like of shape (n_samples,)) – True labels.
y_score (array-like of shape (n_samples,) or (n_samples, n_classes)) – Target scores. These can be either probability estimates or
non-thresholded decision values (as returned by
decision_function on some classifiers).
The binary case expects scores with shape (n_samples,) while the
multiclass case expects scores with shape (n_samples, n_classes).
In the multiclass case, the order of the class scores must
correspond to the order of labels, if provided, or else to
the numerical or lexicographical order of the labels in y_true.
If y_true does not contain all the labels, labels must be
provided.
k (int, default=2) – Number of most likely outcomes considered to find the correct label.
normalize (bool, default=True) – If True, return the fraction of correctly classified samples.
Otherwise, return the number of correctly classified samples.
sample_weight (array-like of shape (n_samples,), default=None) – Sample weights. If None, all samples are given the same weight.
labels (array-like of shape (n_classes,), default=None) – Multiclass only. List of labels that index the classes in y_score.
If None, the numerical or lexicographical order of the labels in
y_true is used. If y_true does not contain all the labels,
labels must be provided.
Returns:
score – The top-k accuracy score. The best performance is 1 with
normalize == True and the number of samples with
normalize == False.
Compute the accuracy score. By default, the function will return the fraction of correct predictions divided by the total number of predictions.
Notes
In cases where two or more labels are assigned equal predicted scores,
the labels with the highest indices will be chosen first. This might
impact the result if the correct label falls after the threshold because
of that.
Examples
>>> importnumpyasnp>>> fromsklearn.metricsimporttop_k_accuracy_score>>> y_true=np.array([0,1,2,2])>>> y_score=np.array([[0.5,0.2,0.2],# 0 is in top 2... [0.3,0.4,0.2],# 1 is in top 2... [0.2,0.4,0.3],# 2 is in top 2... [0.7,0.2,0.1]])# 2 isn't in top 2>>> top_k_accuracy_score(y_true,y_score,k=2)0.75>>> # Not normalizing gives the number of "correctly" classified samples>>> top_k_accuracy_score(y_true,y_score,k=2,normalize=False)3
Compute Cohen’s kappa: a statistic that measures inter-annotator agreement.
This function computes Cohen’s kappa [1]_, a score that expresses the level
of agreement between two annotators on a classification problem. It is
defined as
\[\kappa = (p_o - p_e) / (1 - p_e)\]
where \(p_o\) is the empirical probability of agreement on the label
assigned to any sample (the observed agreement ratio), and \(p_e\) is
the expected agreement when both annotators assign labels randomly.
\(p_e\) is estimated using a per-annotator empirical prior over the
class labels [2]_.
Read more in the User Guide.
Parameters:
y1 (array-like of shape (n_samples,)) – Labels assigned by the first annotator.
y2 (array-like of shape (n_samples,)) – Labels assigned by the second annotator. The kappa statistic is
symmetric, so swapping y1 and y2 doesn’t change the value.
labels (array-like of shape (n_classes,), default=None) – List of labels to index the matrix. This may be used to select a
subset of labels. If None, all labels that appear at least once in
y1 or y2 are used.
weights ({'linear', 'quadratic'}, default=None) – Weighting type to calculate the score. None means no weighted;
“linear” means linear weighted; “quadratic” means quadratic weighted.
sample_weight (array-like of shape (n_samples,), default=None) – Sample weights.
Returns:
kappa – The kappa statistic, which is a number between -1 and 1. The maximum
value means complete agreement; zero or lower means chance agreement.
BEDROC metric implemented according to Truchon and Bayley that modifies
the ROC score by allowing for a factor of early recognition.
Please confirm details from [1]_.
Parameters:
y_true (np.ndarray) – Binary class labels. 1 for positive class, 0 otherwise
y_pred (np.ndarray) – Predicted labels
alpha (float, default 20.0) – Early recognition parameter
Returns:
Value in [0, 1] that indicates the degree of early recognition
encoded_sequences (np.ndarray) – A numpy array of shape (N_sequences, N_letters, sequence_length, 1).
motif_names (List[str]) – List of motif file names.
max_scores (int, optional) – Get top max_scores scores.
return_positions (bool, default False) – Whether to return postions or not.
GC_fraction (float, default 0.4) – GC fraction in background sequence.
Returns:
A numpy array of complete score. The shape is (N_sequences, num_motifs, seq_length) by default.
If max_scores, the shape of score array is (N_sequences, num_motifs*max_scores).
If max_scores and return_positions, the shape of score array with max scores and their positions.
is (N_sequences, 2*num_motifs*max_scores).
The Metric class provides a wrapper for standardizing the API
around different classes of metrics that may be useful for DeepChem
models. The implementation provides a few non-standard conveniences
such as built-in support for multitask and multiclass metrics.
There are a variety of different metrics this class aims to support.
Metrics for classification and regression that assume that values to
compare are scalars are supported.
At present, this class doesn’t support metric computation on models
which don’t present scalar outputs. For example, if you have a
generative model which predicts images or molecules, you will need
to write a custom evaluation and metric setup.
metric (function) – Function that takes args y_true, y_pred (in that order) and
computes desired score. If sample weights are to be considered,
metric may take in an additional keyword argument
sample_weight.
task_averager (function, default None) – If not None, should be a function that averages metrics across
tasks.
name (str, default None) – Name of this metric
threshold (float, default None (DEPRECATED)) – Used for binary metrics and is the threshold for the positive
class.
mode (str, default None) – Should usually be “classification” or “regression.”
n_tasks (int, default None) – The number of tasks this class is expected to handle.
DeepChem models by default predict class probabilities for
classification problems. This means that for a given singletask
prediction, after shape normalization, the DeepChem labels and prediction will be
numpy arrays of shape (n_samples, n_tasks, n_classes) with class probabilities.
classification_handling_mode is a string that instructs this method
how to handle transforming these probabilities. It can take on the
following values:
- “direct”: Pass y_true and y_pred directy into self.metric.
- “threshold”: Use threshold_predictions to threshold y_true and y_pred.
Use threshold_value as the desired threshold. This converts them into
arrays of shape (n_samples, n_tasks), where each element is a class index.
”threshold-one-hot”: Use threshold_predictions to threshold y_true and y_pred
using threshold_values, then apply to_one_hot to output.
None: Select a mode automatically based on the metric.
threshold_value (float, default None) – If set, and classification_handling_mode is “threshold” or
“threshold-one-hot”, apply a thresholding operation to values with this
threshold. This option is only sensible on binary classification tasks.
For multiclass problems, or if threshold_value is None, argmax() is used
to select the highest probability class for each task.
y_true (ArrayLike) – An ArrayLike containing true values for each task. Must be of shape
(N,) or (N, n_tasks) or (N, n_tasks, n_classes) if a
classification metric. If of shape (N, n_tasks) values can either be
class-labels or probabilities of the positive class for binary
classification problems. If a regression problem, must be of shape
(N,) or (N, n_tasks) or (N, n_tasks, 1) if a regression metric.
y_pred (ArrayLike) – An ArrayLike containing predicted values for each task. Must be
of shape (N, n_tasks, n_classes) if a classification metric,
else must be of shape (N, n_tasks) if a regression metric.
w (ArrayLike, default None) – An ArrayLike containing weights for each datapoint. If
specified, must be of shape (N, n_tasks).
n_tasks (int, default None) – The number of tasks this class is expected to handle.
n_classes (int, default 2) – Number of classes in data for classification tasks.
per_task_metrics (bool, default False) – If true, return computed metric for each task on multitask dataset.
use_sample_weights (bool, default False) – If set, use per-sample weights w.
kwargs (dict) – Will be passed on to self.metric
Returns:
A numpy array containing metric values for each task.