Hyperparameter Tuning¶

One of the most important aspects of machine learning is hyperparameter tuning. Many machine learning models have a number of hyperparameters that control aspects of the model. These hyperparameters typically cannot be learned directly by the same learning algorithm used for the rest of learning and have to be set in an alternate fashion. The dc.hyper module contains utilities for hyperparameter tuning.

DeepChem’s hyperparameter optimzation algorithms are simple and run in single-threaded fashion. They are not intended to be production grade hyperparameter utilities, but rather useful first tools as you start exploring your parameter space. As the needs of your application grow, we recommend swapping to a more heavy duty hyperparameter optimization library.

Hyperparameter Optimization API¶

class HyperparamOpt(model_builder: Callable[[...], Model])[source]¶

Abstract superclass for hyperparameter search classes.

This class is an abstract base class for hyperparameter search classes in DeepChem. Hyperparameter search is performed on dc.models.Model classes. Each hyperparameter object accepts a dc.models.Model class upon construct. When the hyperparam_search class is invoked, this class is used to construct many different concrete models which are trained on the specified training set and evaluated on a given validation set.

Different subclasses of HyperparamOpt differ in the choice of strategy for searching the hyperparameter evaluation space. This class itself is an abstract superclass and should never be directly instantiated.

__init__(model_builder: Callable[[...], Model])[source]¶

Initialize Hyperparameter Optimizer.

Note this is an abstract constructor which should only be used by subclasses.

Parameters:: model_builder (constructor function.) – This parameter must be constructor function which returns an object which is an instance of dc.models.Model. This function must accept two arguments, model_params of type dict and model_dir, a string specifying a path to a model directory. See the example.

hyperparam_search(params_dict: Dict, train_dataset: Dataset, valid_dataset: Dataset, metric: Metric, output_transformers: List[Transformer] = [], nb_epoch: int = 10, use_max: bool = True, logfile: str = 'results.txt', logdir: str | None = None, **kwargs) → Tuple[Model, Dict[str, Any], Dict[str, Any]][source]¶

Conduct Hyperparameter search.

This method defines the common API shared by all hyperparameter optimization subclasses. Different classes will implement different search methods but they must all follow this common API.

Parameters:

params_dict (Dict) – Dictionary mapping strings to values. Note that the precise semantics of params_dict will change depending on the optimizer that you’re using. Depending on the type of hyperparameter optimization, these values can be ints/floats/strings/lists/etc. Read the documentation for the concrete hyperparameter optimization subclass you’re using to learn more about what’s expected.
train_dataset (Dataset) – dataset used for training
valid_dataset (Dataset) – dataset used for validation(optimization on valid scores)
metric (Metric) – metric used for evaluation
output_transformers (list[Transformer]) – Transformers for evaluation. This argument is needed since train_dataset and valid_dataset may have been transformed for learning and need the transform to be inverted before the metric can be evaluated on a model.
nb_epoch (int, (default 10)) – Specifies the number of training epochs during each iteration of optimization.
use_max (bool, optional) – If True, return the model with the highest score. Else return model with the minimum score.
logdir (str, optional) – The directory in which to store created models. If not set, will use a temporary directory.
logfile (str, optional (default results.txt)) – Name of logfile to write results to. If specified, this must be a valid file name. If not specified, results of hyperparameter search will be written to logdir/results.txt.

Returns:

(best_model, best_hyperparams, all_scores) where best_model is an instance of dc.models.Model, best_hyperparams is a dictionary of parameters, and all_scores is a dictionary mapping string representations of hyperparameter sets to validation scores.

Return type:

Tuple[best_model, best_hyperparams, all_scores]

Grid Hyperparameter Optimization¶

This is the simplest form of hyperparameter optimization that simply involves iterating over a fixed grid of possible values for hyperaparameters.

class GridHyperparamOpt(model_builder: Callable[[...], Model])[source]¶

Provides simple grid hyperparameter search capabilities.

This class performs a grid hyperparameter search over the specified hyperparameter space. This implementation is simple and simply does a direct iteration over all possible hyperparameters and doesn’t use parallelization to speed up the search.

Examples

This example shows the type of constructor function expected.

>>> import sklearn
>>> import deepchem as dc
>>> optimizer = dc.hyper.GridHyperparamOpt(lambda **p: dc.models.GraphConvModel(**p))

Here’s a more sophisticated example that shows how to optimize only some parameters of a model. In this case, we have some parameters we want to optimize, and others which we don’t. To handle this type of search, we create a model_builder which hard codes some arguments (in this case, max_iter is a hyperparameter which we don’t want to search over)

>>> import deepchem as dc
>>> import numpy as np
>>> from sklearn.linear_model import LogisticRegression as LR
>>> # generating data
>>> X = np.arange(1, 11, 1).reshape(-1, 1)
>>> y = np.hstack((np.zeros(5), np.ones(5)))
>>> dataset = dc.data.NumpyDataset(X, y)
>>> # splitting dataset into train and test
>>> splitter = dc.splits.RandomSplitter()
>>> train_dataset, test_dataset = splitter.train_test_split(dataset)
>>> # metric to evaluate result of a set of parameters
>>> metric = dc.metrics.Metric(dc.metrics.accuracy_score)
>>> # defining `model_builder`
>>> def model_builder(**model_params):
...   penalty = model_params['penalty']
...   solver = model_params['solver']
...   lr = LR(penalty=penalty, solver=solver, max_iter=100)
...   return dc.models.SklearnModel(lr)
>>> # the parameters which are to be optimized
>>> params = {
...   'penalty': ['l1', 'l2'],
...   'solver': ['liblinear', 'saga']
...   }
>>> # Creating optimizer and searching over hyperparameters
>>> optimizer = dc.hyper.GridHyperparamOpt(model_builder)
>>> best_model, best_hyperparams, all_results =     optimizer.hyperparam_search(params, train_dataset, test_dataset, metric)
>>> best_hyperparams  # the best hyperparameters
{'penalty': 'l2', 'solver': 'saga'}

hyperparam_search(params_dict: Dict, train_dataset: Dataset, valid_dataset: Dataset, metric: Metric, output_transformers: List[Transformer] = [], nb_epoch: int = 10, use_max: bool = True, logfile: str = 'results.txt', logdir: str | None = None, **kwargs) → Tuple[Model, Dict, Dict][source]¶

Perform hyperparams search according to params_dict.

Each key to hyperparams_dict is a model_param. The values should be a list of potential values for that hyperparam.

Parameters:

params_dict (Dict) – Maps hyperparameter names (strings) to lists of possible parameter values.
train_dataset (Dataset) – dataset used for training
valid_dataset (Dataset) – dataset used for validation(optimization on valid scores)
metric (Metric) – metric used for evaluation
output_transformers (list[Transformer]) – Transformers for evaluation. This argument is needed since train_dataset and valid_dataset may have been transformed for learning and need the transform to be inverted before the metric can be evaluated on a model.
nb_epoch (int, (default 10)) – Specifies the number of training epochs during each iteration of optimization. Not used by all model types.
use_max (bool, optional) – If True, return the model with the highest score. Else return model with the minimum score.
logdir (str, optional) – The directory in which to store created models. If not set, will use a temporary directory.
logfile (str, optional (default results.txt)) – Name of logfile to write results to. If specified, this is must be a valid file name. If not specified, results of hyperparameter search will be written to logdir/results.txt.

Returns:

Tuple[best_model, best_hyperparams, all_scores]
(best_model, best_hyperparams, all_scores) where best_model is
an instance of dc.model.Model, best_hyperparams is a
dictionary of parameters, and all_scores is a dictionary mapping
string representations of hyperparameter sets to validation
scores.

Notes

From DeepChem 2.6, the return type of best_hyperparams is a dictionary of parameters rather than a tuple of parameters as it was previously. The new changes have been made to standardize the behaviour across different hyperparameter optimization techniques available in DeepChem.

Gaussian Process Hyperparameter Optimization¶

class GaussianProcessHyperparamOpt(model_builder: Callable[[...], Model], max_iter: int = 20, search_range: int | float | Dict = 4)[source]¶

Gaussian Process Global Optimization(GPGO)

This class uses Gaussian Process optimization to select hyperparameters. Underneath the hood it uses pyGPGO to optimize models. If you don’t have pyGPGO installed, you won’t be able to use this class.

Note that params_dict has a different semantics than for GridHyperparamOpt. param_dict[hp] must be an int/float and is used as the center of a search range.

Examples

This example shows the type of constructor function expected.

>>> import deepchem as dc
>>> optimizer = dc.hyper.GaussianProcessHyperparamOpt(lambda **p: dc.models.GraphConvModel(n_tasks=1, **p))

Here’s a more sophisticated example that shows how to optimize only some parameters of a model. In this case, we have some parameters we want to optimize, and others which we don’t. To handle this type of search, we create a model_builder which hard codes some arguments (in this case, n_tasks and n_features which are properties of a dataset and not hyperparameters to search over.)

>>> import numpy as np
>>> from sklearn.ensemble import RandomForestRegressor as RF
>>> def model_builder(**model_params):
...   n_estimators = model_params['n_estimators']
...   min_samples_split = model_params['min_samples_split']
...   rf_model = RF(n_estimators=n_estimators, min_samples_split=min_samples_split)
...   rf_model = RF(n_estimators=n_estimators)
...   return dc.models.SklearnModel(rf_model)
>>> optimizer = dc.hyper.GaussianProcessHyperparamOpt(model_builder)
>>> params_dict = {"n_estimators":100, "min_samples_split":2}
>>> train_dataset = dc.data.NumpyDataset(X=np.random.rand(50, 5),
...   y=np.random.rand(50, 1))
>>> valid_dataset = dc.data.NumpyDataset(X=np.random.rand(20, 5),
...   y=np.random.rand(20, 1))
>>> metric = dc.metrics.Metric(dc.metrics.pearson_r2_score)

>> best_model, best_hyperparams, all_results = optimizer.hyperparam_search(params_dict, train_dataset, valid_dataset, metric, max_iter=2) >> type(best_hyperparams) <class ‘dict’>

Parameters:

model_builder (constructor function.) – This parameter must be constructor function which returns an object which is an instance of dc.models.Model. This function must accept two arguments, model_params of type dict and model_dir, a string specifying a path to a model directory.
max_iter (int, default 20) – number of optimization trials
search_range (int/float/Dict (default 4)) –
The search_range specifies the range of parameter values to search for. If search_range is an int/float, it is used as the global search range for parameters. This creates a search problem on the following space:

optimization on [initial value / search_range,
initial value * search_range]

If search_range is a dict, it must contain the same keys as for params_dict. In this case, search_range specifies a per-parameter search range. This is useful in case some parameters have a larger natural range than others. For a given hyperparameter hp this would create the following search range:

optimization on hp on [initial value[hp] / search_range[hp],
initial value[hp] * search_range[hp]]

Notes

This class requires pyGPGO to be installed.

__init__(model_builder: Callable[[...], Model], max_iter: int = 20, search_range: int | float | Dict = 4)[source]¶

Initialize Hyperparameter Optimizer.

Note this is an abstract constructor which should only be used by subclasses.

Parameters:: model_builder (constructor function.) – This parameter must be constructor function which returns an object which is an instance of dc.models.Model. This function must accept two arguments, model_params of type dict and model_dir, a string specifying a path to a model directory. See the example.

hyperparam_search(params_dict: Dict, train_dataset: Dataset, valid_dataset: Dataset, metric: Metric, output_transformers: List[Transformer] = [], nb_epoch: int = 10, use_max: bool = True, logfile: str = 'results.txt', logdir: str | None = None, **kwargs) → Tuple[Model, Dict[str, Any], Dict[str, Any]][source]¶

Perform hyperparameter search using a gaussian process.

Parameters:

params_dict (Dict) – Maps hyperparameter names (strings) to possible parameter values. The semantics of this list are different than for GridHyperparamOpt. params_dict[hp] must map to an int/float, which is used as the center of a search with radius search_range since pyGPGO can only optimize numerical hyperparameters.
train_dataset (Dataset) – dataset used for training
valid_dataset (Dataset) – dataset used for validation(optimization on valid scores)
metric (Metric) – metric used for evaluation
output_transformers (list[Transformer]) – Transformers for evaluation. This argument is needed since train_dataset and valid_dataset may have been transformed for learning and need the transform to be inverted before the metric can be evaluated on a model.
nb_epoch (int, (default 10)) – Specifies the number of training epochs during each iteration of optimization. Not used by all model types.
use_max (bool, (default True)) – Specifies whether to maximize or minimize metric. maximization(True) or minimization(False)
logdir (str, optional, (default None)) – The directory in which to store created models. If not set, will use a temporary directory.
logfile (str, optional (default results.txt)) – Name of logfile to write results to. If specified, this is must be a valid file. If not specified, results of hyperparameter search will be written to logdir/results.txt.

Returns:

(best_model, best_hyperparams, all_scores) where best_model is an instance of dc.model.Model, best_hyperparams is a dictionary of parameters, and all_scores is a dictionary mapping string representations of hyperparameter sets to validation scores.

Return type:

Tuple[best_model, best_hyperparams, all_scores]