Splitters

DeepChem dc.splits.Splitter objects are a tool to meaningfully split DeepChem datasets for machine learning testing. The core idea is that when evaluating a machine learning model, it’s useful to creating training, validation and test splits of your source data. The training split is used to train models, the validatation is used to benchmark different model architectures. The test is ideally held out till the very end when it’s used to gauge a final estimate of the model’s performance.

The dc.splits module contains a collection of scientifically aware splitters. In many cases, we want to evaluate scientific deep learning models more rigorously than standard deep models since we’re looking for the ability to generalize to new domains. Some of the implemented splitters here may help.

Splitter

The dc.splits.Splitter class is the abstract parent class for all splitters. This class should never be directly instantiated.

class deepchem.splits.Splitter[source]

Splitters split up Datasets into pieces for training/validation/testing.

In machine learning applications, it’s often necessary to split up a dataset into training/validation/test sets. Or to k-fold split a dataset (that is, divide into k equal subsets) for cross-validation. The Splitter class is an abstract superclass for all splitters that captures the common API across splitter classes.

Note that Splitter is an abstract superclass. You won’t want to instantiate this class directly. Rather you will want to use a concrete subclass for your application.

__init__

Initialize self. See help(type(self)) for accurate signature.

k_fold_split(dataset, k, directories=None, **kwargs)[source]
Parameters:
  • dataset (dc.data.Dataset) – Dataset to do a k-fold split
  • k (int) – Number of folds to split dataset into.
  • directories (list[str]) – list of length 2*k filepaths to save the result disk-datasets
Returns:

  • list of length k tuples of (train, cv) where train and cv are both
  • lists of `Dataset`s.

split(dataset, seed=None, frac_train=None, frac_valid=None, frac_test=None, log_every_n=None, **kwargs)[source]

Return indices for specified split

Parameters:
  • dataset (dc.data.Dataset) – Dataset to be split
  • seed (int, optional (default None)) – Random seed to use.
  • frac_train (float, optional (default 0.8)) – The fraction of data to be used for the training split.
  • frac_valid (float, optional (default 0.1)) – The fraction of data to be used for the validation split.
  • frac_test (float, optional (default 0.1)) – The fraction of data to be used for the test split.
  • log_every_n (int, optional) – Controls the logger by dictating how often logger outputs will be produced.
Returns:

  • A tuple (train_inds, valid_inds, test_inds of the indices (integers) for
  • the various splits.

train_test_split(dataset, train_dir=None, test_dir=None, seed=None, frac_train=0.8, **kwargs)[source]

Splits self into train/test sets.

Returns Dataset objects for train/test.

Parameters:
  • dataset (data like object) – Dataset to be split. This should either be of type dc.data.Dataset or a type that dc.utils.data.datasetify can convert into a Dataset.
  • train_dir (str, optional) – If specified, the directory in which the generated training dataset should be stored. This is only considered if isinstance(dataset, dc.data.DiskDataset) is True.
  • test_dir (str, optional) – If specified, the directory in which the generated test dataset should be stored. This is only considered if isinstance(dataset, dc.data.DiskDataset) is True.
  • seed (int, optional (default None)) – Random seed to use.
  • frac_train (float, optional (default 0.8)) – The fraction of data to be used for the training split.
Returns:

Return type:

Train and test datasets as dc.data.Dataset objects.

train_valid_test_split(dataset, train_dir=None, valid_dir=None, test_dir=None, frac_train=0.8, frac_valid=0.1, frac_test=0.1, seed=None, log_every_n=1000, **kwargs)[source]

Splits self into train/validation/test sets.

Returns Dataset objects for train, valid, test.

Parameters:
  • dataset (data like object.) – Dataset to be split. This should either be of type dc.data.Dataset or a type that dc.utils.data.datasetify can convert into a Dataset.
  • train_dir (str, optional) – If specified, the directory in which the generated training dataset should be stored. This is only considered if isinstance(dataset, dc.data.DiskDataset)
  • valid_dir (str, optional) – If specified, the directory in which the generated valid dataset should be stored. This is only considered if isinstance(dataset, dc.data.DiskDataset) is True.
  • test_dir (str, optional) – If specified, the directory in which the generated test dataset should be stored. This is only considered if isinstance(dataset, dc.data.DiskDataset) is True.
  • frac_train (float, optional (default 0.8)) – The fraction of data to be used for the training split.
  • frac_valid (float, optional (default 0.1)) – The fraction of data to be used for the validation split.
  • frac_test (float, optional (default 0.1)) – The fraction of data to be used for the test split.
  • seed (int, optional (default None)) – Random seed to use.
  • log_every_n (int, optional) – Controls the logger by dictating how often logger outputs will be produced.
Returns:

Return type:

Train and test datasets as dc.data.Dataset objects.

RandomSplitter

class deepchem.splits.RandomSplitter[source]

Class for doing random data splits.

__init__

Initialize self. See help(type(self)) for accurate signature.

k_fold_split(dataset, k, directories=None, **kwargs)[source]
Parameters:
  • dataset (dc.data.Dataset) – Dataset to do a k-fold split
  • k (int) – Number of folds to split dataset into.
  • directories (list[str]) – list of length 2*k filepaths to save the result disk-datasets
Returns:

  • list of length k tuples of (train, cv) where train and cv are both
  • lists of `Dataset`s.

split(dataset, seed=None, frac_train=0.8, frac_valid=0.1, frac_test=0.1, log_every_n=None)[source]

Splits internal compounds randomly into train/validation/test.

Parameters:
  • dataset (dc.data.Dataset) – Dataset to be split
  • seed (int, optional (default None)) – Random seed to use.
  • frac_train (float, optional (default 0.8)) – The fraction of data to be used for the training split.
  • frac_valid (float, optional (default 0.1)) – The fraction of data to be used for the validation split.
  • frac_test (float, optional (default 0.1)) – The fraction of data to be used for the test split.
  • log_every_n (int, optional) – Controls the logger by dictating how often logger outputs will be produced.
Returns:

  • A tuple (train_inds, valid_inds, test_inds of the indices (integers) for
  • the various splits.

train_test_split(dataset, train_dir=None, test_dir=None, seed=None, frac_train=0.8, **kwargs)[source]

Splits self into train/test sets.

Returns Dataset objects for train/test.

Parameters:
  • dataset (data like object) – Dataset to be split. This should either be of type dc.data.Dataset or a type that dc.utils.data.datasetify can convert into a Dataset.
  • train_dir (str, optional) – If specified, the directory in which the generated training dataset should be stored. This is only considered if isinstance(dataset, dc.data.DiskDataset) is True.
  • test_dir (str, optional) – If specified, the directory in which the generated test dataset should be stored. This is only considered if isinstance(dataset, dc.data.DiskDataset) is True.
  • seed (int, optional (default None)) – Random seed to use.
  • frac_train (float, optional (default 0.8)) – The fraction of data to be used for the training split.
Returns:

Return type:

Train and test datasets as dc.data.Dataset objects.

train_valid_test_split(dataset, train_dir=None, valid_dir=None, test_dir=None, frac_train=0.8, frac_valid=0.1, frac_test=0.1, seed=None, log_every_n=1000, **kwargs)[source]

Splits self into train/validation/test sets.

Returns Dataset objects for train, valid, test.

Parameters:
  • dataset (data like object.) – Dataset to be split. This should either be of type dc.data.Dataset or a type that dc.utils.data.datasetify can convert into a Dataset.
  • train_dir (str, optional) – If specified, the directory in which the generated training dataset should be stored. This is only considered if isinstance(dataset, dc.data.DiskDataset)
  • valid_dir (str, optional) – If specified, the directory in which the generated valid dataset should be stored. This is only considered if isinstance(dataset, dc.data.DiskDataset) is True.
  • test_dir (str, optional) – If specified, the directory in which the generated test dataset should be stored. This is only considered if isinstance(dataset, dc.data.DiskDataset) is True.
  • frac_train (float, optional (default 0.8)) – The fraction of data to be used for the training split.
  • frac_valid (float, optional (default 0.1)) – The fraction of data to be used for the validation split.
  • frac_test (float, optional (default 0.1)) – The fraction of data to be used for the test split.
  • seed (int, optional (default None)) – Random seed to use.
  • log_every_n (int, optional) – Controls the logger by dictating how often logger outputs will be produced.
Returns:

Return type:

Train and test datasets as dc.data.Dataset objects.

IndexSplitter

class deepchem.splits.IndexSplitter[source]

Class for simple order based splits.

Use this class when the Dataset you have is already ordered sa you would like it to be processed. Then the first frac_train proportion is used for training, the next frac_valid for validation, and the final frac_test for testing. This class may make sense to use your Dataset is already time ordered (for example).

__init__

Initialize self. See help(type(self)) for accurate signature.

k_fold_split(dataset, k, directories=None, **kwargs)[source]
Parameters:
  • dataset (dc.data.Dataset) – Dataset to do a k-fold split
  • k (int) – Number of folds to split dataset into.
  • directories (list[str]) – list of length 2*k filepaths to save the result disk-datasets
Returns:

  • list of length k tuples of (train, cv) where train and cv are both
  • lists of `Dataset`s.

split(dataset, seed=None, frac_train=0.8, frac_valid=0.1, frac_test=0.1, log_every_n=None)[source]

Splits internal compounds into train/validation/test in provided order.

Parameters:
  • dataset (dc.data.Dataset) – Dataset to be split
  • seed (int, optional (default None)) – Random seed to use.
  • frac_train (float, optional (default 0.8)) – The fraction of data to be used for the training split.
  • frac_valid (float, optional (default 0.1)) – The fraction of data to be used for the validation split.
  • frac_test (float, optional (default 0.1)) – The fraction of data to be used for the test split.
  • log_every_n (int, optional) – Controls the logger by dictating how often logger outputs will be produced.
Returns:

  • A tuple (train_inds, valid_inds, test_inds of the indices (integers) for
  • the various splits.

train_test_split(dataset, train_dir=None, test_dir=None, seed=None, frac_train=0.8, **kwargs)[source]

Splits self into train/test sets.

Returns Dataset objects for train/test.

Parameters:
  • dataset (data like object) – Dataset to be split. This should either be of type dc.data.Dataset or a type that dc.utils.data.datasetify can convert into a Dataset.
  • train_dir (str, optional) – If specified, the directory in which the generated training dataset should be stored. This is only considered if isinstance(dataset, dc.data.DiskDataset) is True.
  • test_dir (str, optional) – If specified, the directory in which the generated test dataset should be stored. This is only considered if isinstance(dataset, dc.data.DiskDataset) is True.
  • seed (int, optional (default None)) – Random seed to use.
  • frac_train (float, optional (default 0.8)) – The fraction of data to be used for the training split.
Returns:

Return type:

Train and test datasets as dc.data.Dataset objects.

train_valid_test_split(dataset, train_dir=None, valid_dir=None, test_dir=None, frac_train=0.8, frac_valid=0.1, frac_test=0.1, seed=None, log_every_n=1000, **kwargs)[source]

Splits self into train/validation/test sets.

Returns Dataset objects for train, valid, test.

Parameters:
  • dataset (data like object.) – Dataset to be split. This should either be of type dc.data.Dataset or a type that dc.utils.data.datasetify can convert into a Dataset.
  • train_dir (str, optional) – If specified, the directory in which the generated training dataset should be stored. This is only considered if isinstance(dataset, dc.data.DiskDataset)
  • valid_dir (str, optional) – If specified, the directory in which the generated valid dataset should be stored. This is only considered if isinstance(dataset, dc.data.DiskDataset) is True.
  • test_dir (str, optional) – If specified, the directory in which the generated test dataset should be stored. This is only considered if isinstance(dataset, dc.data.DiskDataset) is True.
  • frac_train (float, optional (default 0.8)) – The fraction of data to be used for the training split.
  • frac_valid (float, optional (default 0.1)) – The fraction of data to be used for the validation split.
  • frac_test (float, optional (default 0.1)) – The fraction of data to be used for the test split.
  • seed (int, optional (default None)) – Random seed to use.
  • log_every_n (int, optional) – Controls the logger by dictating how often logger outputs will be produced.
Returns:

Return type:

Train and test datasets as dc.data.Dataset objects.

IndiceSplitter

class deepchem.splits.IndiceSplitter(valid_indices=None, test_indices=None)[source]

Split data in the fasion specified by user.

For some applications, you will already know how you’d like to split the dataset. In this splitter, you simplify specify valid_indices and test_indices and the datapoints at those indices are pulled out of the dataset. Note that this is different from IndexSplitter which only splits based on the existing dataset orderning, while this IndiceSplitter can split on any specified ordering.

__init__(valid_indices=None, test_indices=None)[source]
Parameters:
  • valid_indices (list of int) – indices of samples in the valid set
  • test_indices (list of int) – indices of samples in the test set
k_fold_split(dataset, k, directories=None, **kwargs)[source]
Parameters:
  • dataset (dc.data.Dataset) – Dataset to do a k-fold split
  • k (int) – Number of folds to split dataset into.
  • directories (list[str]) – list of length 2*k filepaths to save the result disk-datasets
Returns:

  • list of length k tuples of (train, cv) where train and cv are both
  • lists of `Dataset`s.

split(dataset, seed=None, frac_train=0.8, frac_valid=0.1, frac_test=0.1, log_every_n=None)[source]

Splits internal compounds into train/validation/test in designated order.

train_test_split(dataset, train_dir=None, test_dir=None, seed=None, frac_train=0.8, **kwargs)[source]

Splits self into train/test sets.

Returns Dataset objects for train/test.

Parameters:
  • dataset (data like object) – Dataset to be split. This should either be of type dc.data.Dataset or a type that dc.utils.data.datasetify can convert into a Dataset.
  • train_dir (str, optional) – If specified, the directory in which the generated training dataset should be stored. This is only considered if isinstance(dataset, dc.data.DiskDataset) is True.
  • test_dir (str, optional) – If specified, the directory in which the generated test dataset should be stored. This is only considered if isinstance(dataset, dc.data.DiskDataset) is True.
  • seed (int, optional (default None)) – Random seed to use.
  • frac_train (float, optional (default 0.8)) – The fraction of data to be used for the training split.
Returns:

Return type:

Train and test datasets as dc.data.Dataset objects.

train_valid_test_split(dataset, train_dir=None, valid_dir=None, test_dir=None, frac_train=0.8, frac_valid=0.1, frac_test=0.1, seed=None, log_every_n=1000, **kwargs)[source]

Splits self into train/validation/test sets.

Returns Dataset objects for train, valid, test.

Parameters:
  • dataset (data like object.) – Dataset to be split. This should either be of type dc.data.Dataset or a type that dc.utils.data.datasetify can convert into a Dataset.
  • train_dir (str, optional) – If specified, the directory in which the generated training dataset should be stored. This is only considered if isinstance(dataset, dc.data.DiskDataset)
  • valid_dir (str, optional) – If specified, the directory in which the generated valid dataset should be stored. This is only considered if isinstance(dataset, dc.data.DiskDataset) is True.
  • test_dir (str, optional) – If specified, the directory in which the generated test dataset should be stored. This is only considered if isinstance(dataset, dc.data.DiskDataset) is True.
  • frac_train (float, optional (default 0.8)) – The fraction of data to be used for the training split.
  • frac_valid (float, optional (default 0.1)) – The fraction of data to be used for the validation split.
  • frac_test (float, optional (default 0.1)) – The fraction of data to be used for the test split.
  • seed (int, optional (default None)) – Random seed to use.
  • log_every_n (int, optional) – Controls the logger by dictating how often logger outputs will be produced.
Returns:

Return type:

Train and test datasets as dc.data.Dataset objects.

SpecifiedSplitter

class deepchem.splits.SpecifiedSplitter(input_file, split_field)[source]

Class that splits data according to user specification.

__init__(input_file, split_field)[source]

Provide input information for splits.

k_fold_split(dataset, k, directories=None, **kwargs)[source]
Parameters:
  • dataset (dc.data.Dataset) – Dataset to do a k-fold split
  • k (int) – Number of folds to split dataset into.
  • directories (list[str]) – list of length 2*k filepaths to save the result disk-datasets
Returns:

  • list of length k tuples of (train, cv) where train and cv are both
  • lists of `Dataset`s.

split(dataset, seed=None, frac_train=0.8, frac_valid=0.1, frac_test=0.1, log_every_n=1000)[source]

Splits internal compounds into train/validation/test by user-specification.

train_test_split(dataset, train_dir=None, test_dir=None, seed=None, frac_train=0.8, **kwargs)[source]

Splits self into train/test sets.

Returns Dataset objects for train/test.

Parameters:
  • dataset (data like object) – Dataset to be split. This should either be of type dc.data.Dataset or a type that dc.utils.data.datasetify can convert into a Dataset.
  • train_dir (str, optional) – If specified, the directory in which the generated training dataset should be stored. This is only considered if isinstance(dataset, dc.data.DiskDataset) is True.
  • test_dir (str, optional) – If specified, the directory in which the generated test dataset should be stored. This is only considered if isinstance(dataset, dc.data.DiskDataset) is True.
  • seed (int, optional (default None)) – Random seed to use.
  • frac_train (float, optional (default 0.8)) – The fraction of data to be used for the training split.
Returns:

Return type:

Train and test datasets as dc.data.Dataset objects.

train_valid_test_split(dataset, train_dir=None, valid_dir=None, test_dir=None, frac_train=0.8, frac_valid=0.1, frac_test=0.1, seed=None, log_every_n=1000, **kwargs)[source]

Splits self into train/validation/test sets.

Returns Dataset objects for train, valid, test.

Parameters:
  • dataset (data like object.) – Dataset to be split. This should either be of type dc.data.Dataset or a type that dc.utils.data.datasetify can convert into a Dataset.
  • train_dir (str, optional) – If specified, the directory in which the generated training dataset should be stored. This is only considered if isinstance(dataset, dc.data.DiskDataset)
  • valid_dir (str, optional) – If specified, the directory in which the generated valid dataset should be stored. This is only considered if isinstance(dataset, dc.data.DiskDataset) is True.
  • test_dir (str, optional) – If specified, the directory in which the generated test dataset should be stored. This is only considered if isinstance(dataset, dc.data.DiskDataset) is True.
  • frac_train (float, optional (default 0.8)) – The fraction of data to be used for the training split.
  • frac_valid (float, optional (default 0.1)) – The fraction of data to be used for the validation split.
  • frac_test (float, optional (default 0.1)) – The fraction of data to be used for the test split.
  • seed (int, optional (default None)) – Random seed to use.
  • log_every_n (int, optional) – Controls the logger by dictating how often logger outputs will be produced.
Returns:

Return type:

Train and test datasets as dc.data.Dataset objects.

SpecifiedIndexSplitter

class deepchem.splits.SpecifiedIndexSplitter(train_inds, valid_inds, test_inds)[source]

Class that splits data according to user index specification

__init__(train_inds, valid_inds, test_inds)[source]

Provide input information for splits.

k_fold_split(dataset, k, directories=None, **kwargs)[source]
Parameters:
  • dataset (dc.data.Dataset) – Dataset to do a k-fold split
  • k (int) – Number of folds to split dataset into.
  • directories (list[str]) – list of length 2*k filepaths to save the result disk-datasets
Returns:

  • list of length k tuples of (train, cv) where train and cv are both
  • lists of `Dataset`s.

split(dataset, seed=None, frac_train=0.8, frac_valid=0.1, frac_test=0.1, log_every_n=1000)[source]

Splits internal compounds into train/validation/test by user-specification.

train_test_split(dataset, train_dir=None, test_dir=None, seed=None, frac_train=0.8, **kwargs)[source]

Splits self into train/test sets.

Returns Dataset objects for train/test.

Parameters:
  • dataset (data like object) – Dataset to be split. This should either be of type dc.data.Dataset or a type that dc.utils.data.datasetify can convert into a Dataset.
  • train_dir (str, optional) – If specified, the directory in which the generated training dataset should be stored. This is only considered if isinstance(dataset, dc.data.DiskDataset) is True.
  • test_dir (str, optional) – If specified, the directory in which the generated test dataset should be stored. This is only considered if isinstance(dataset, dc.data.DiskDataset) is True.
  • seed (int, optional (default None)) – Random seed to use.
  • frac_train (float, optional (default 0.8)) – The fraction of data to be used for the training split.
Returns:

Return type:

Train and test datasets as dc.data.Dataset objects.

train_valid_test_split(dataset, train_dir=None, valid_dir=None, test_dir=None, frac_train=0.8, frac_valid=0.1, frac_test=0.1, seed=None, log_every_n=1000, **kwargs)[source]

Splits self into train/validation/test sets.

Returns Dataset objects for train, valid, test.

Parameters:
  • dataset (data like object.) – Dataset to be split. This should either be of type dc.data.Dataset or a type that dc.utils.data.datasetify can convert into a Dataset.
  • train_dir (str, optional) – If specified, the directory in which the generated training dataset should be stored. This is only considered if isinstance(dataset, dc.data.DiskDataset)
  • valid_dir (str, optional) – If specified, the directory in which the generated valid dataset should be stored. This is only considered if isinstance(dataset, dc.data.DiskDataset) is True.
  • test_dir (str, optional) – If specified, the directory in which the generated test dataset should be stored. This is only considered if isinstance(dataset, dc.data.DiskDataset) is True.
  • frac_train (float, optional (default 0.8)) – The fraction of data to be used for the training split.
  • frac_valid (float, optional (default 0.1)) – The fraction of data to be used for the validation split.
  • frac_test (float, optional (default 0.1)) – The fraction of data to be used for the test split.
  • seed (int, optional (default None)) – Random seed to use.
  • log_every_n (int, optional) – Controls the logger by dictating how often logger outputs will be produced.
Returns:

Return type:

Train and test datasets as dc.data.Dataset objects.

RandomGroupSplitter

class deepchem.splits.RandomGroupSplitter(groups, *args, **kwargs)[source]

Random split based on groupings.

A splitter class that splits on groupings. An example use case is when there are multiple conformations of the same molecule that share the same topology. This splitter subsequently guarantees that resulting splits preserve groupings.

Note that it doesn’t do any dynamic programming or something fancy to try to maximize the choice such that frac_train, frac_valid, or frac_test is maximized. It simply permutes the groups themselves. As such, use with caution if the number of elements per group varies significantly.

__init__(groups, *args, **kwargs)[source]

Initialize this object.

Parameters:groups (array like list of hashables) –

An auxiliary array indicating the group of each item.

Eg: g: 3 2 2 0 1 1 2 4 3 X: 0 1 2 3 4 5 6 7 8

Eg: g: a b b e q x a a r X: 0 1 2 3 4 5 6 7 8

k_fold_split(dataset, k, directories=None, **kwargs)[source]
Parameters:
  • dataset (dc.data.Dataset) – Dataset to do a k-fold split
  • k (int) – Number of folds to split dataset into.
  • directories (list[str]) – list of length 2*k filepaths to save the result disk-datasets
Returns:

  • list of length k tuples of (train, cv) where train and cv are both
  • lists of `Dataset`s.

split(dataset, seed=None, frac_train=0.8, frac_valid=0.1, frac_test=0.1, log_every_n=None)[source]

Return indices for specified split

Parameters:
  • dataset (dc.data.Dataset) – Dataset to be split
  • seed (int, optional (default None)) – Random seed to use.
  • frac_train (float, optional (default 0.8)) – The fraction of data to be used for the training split.
  • frac_valid (float, optional (default 0.1)) – The fraction of data to be used for the validation split.
  • frac_test (float, optional (default 0.1)) – The fraction of data to be used for the test split.
  • log_every_n (int, optional) – Controls the logger by dictating how often logger outputs will be produced.
Returns:

  • A tuple (train_inds, valid_inds, test_inds of the indices (integers) for
  • the various splits.

train_test_split(dataset, train_dir=None, test_dir=None, seed=None, frac_train=0.8, **kwargs)[source]

Splits self into train/test sets.

Returns Dataset objects for train/test.

Parameters:
  • dataset (data like object) – Dataset to be split. This should either be of type dc.data.Dataset or a type that dc.utils.data.datasetify can convert into a Dataset.
  • train_dir (str, optional) – If specified, the directory in which the generated training dataset should be stored. This is only considered if isinstance(dataset, dc.data.DiskDataset) is True.
  • test_dir (str, optional) – If specified, the directory in which the generated test dataset should be stored. This is only considered if isinstance(dataset, dc.data.DiskDataset) is True.
  • seed (int, optional (default None)) – Random seed to use.
  • frac_train (float, optional (default 0.8)) – The fraction of data to be used for the training split.
Returns:

Return type:

Train and test datasets as dc.data.Dataset objects.

train_valid_test_split(dataset, train_dir=None, valid_dir=None, test_dir=None, frac_train=0.8, frac_valid=0.1, frac_test=0.1, seed=None, log_every_n=1000, **kwargs)[source]

Splits self into train/validation/test sets.

Returns Dataset objects for train, valid, test.

Parameters:
  • dataset (data like object.) – Dataset to be split. This should either be of type dc.data.Dataset or a type that dc.utils.data.datasetify can convert into a Dataset.
  • train_dir (str, optional) – If specified, the directory in which the generated training dataset should be stored. This is only considered if isinstance(dataset, dc.data.DiskDataset)
  • valid_dir (str, optional) – If specified, the directory in which the generated valid dataset should be stored. This is only considered if isinstance(dataset, dc.data.DiskDataset) is True.
  • test_dir (str, optional) – If specified, the directory in which the generated test dataset should be stored. This is only considered if isinstance(dataset, dc.data.DiskDataset) is True.
  • frac_train (float, optional (default 0.8)) – The fraction of data to be used for the training split.
  • frac_valid (float, optional (default 0.1)) – The fraction of data to be used for the validation split.
  • frac_test (float, optional (default 0.1)) – The fraction of data to be used for the test split.
  • seed (int, optional (default None)) – Random seed to use.
  • log_every_n (int, optional) – Controls the logger by dictating how often logger outputs will be produced.
Returns:

Return type:

Train and test datasets as dc.data.Dataset objects.

RandomStratifiedSplitter

class deepchem.splits.RandomStratifiedSplitter[source]

RandomStratified Splitter class.

For sparse multitask datasets, a standard split offers no guarantees that the splits will have any activate compounds. This class guarantees that each task will have a proportional split of the activates in a split. TO do this, a ragged split is performed with different numbers of compounds taken from each task. Thus, the length of the split arrays may exceed the split of the original array. That said, no datapoint is copied to more than one split, so correctness is still ensured.

Note that this splitter is only valid for boolean label data.

TODO(rbharath): This splitter should be refactored to match style of other splitter classes.

__init__

Initialize self. See help(type(self)) for accurate signature.

get_task_split_indices(y, w, frac_split)[source]

Returns num datapoints needed per task to split properly.

k_fold_split(dataset, k, directories=None, **kwargs)[source]

Needs custom implementation due to ragged splits for stratification.

split(dataset, frac_split, split_dirs=None)[source]

Method that does bulk of splitting dataset.

train_test_split(dataset, train_dir=None, test_dir=None, seed=None, frac_train=0.8, **kwargs)[source]

Splits self into train/test sets.

Returns Dataset objects for train/test.

Parameters:
  • dataset (data like object) – Dataset to be split. This should either be of type dc.data.Dataset or a type that dc.utils.data.datasetify can convert into a Dataset.
  • train_dir (str, optional) – If specified, the directory in which the generated training dataset should be stored. This is only considered if isinstance(dataset, dc.data.DiskDataset) is True.
  • test_dir (str, optional) – If specified, the directory in which the generated test dataset should be stored. This is only considered if isinstance(dataset, dc.data.DiskDataset) is True.
  • seed (int, optional (default None)) – Random seed to use.
  • frac_train (float, optional (default 0.8)) – The fraction of data to be used for the training split.
Returns:

Return type:

Train and test datasets as dc.data.Dataset objects.

train_valid_test_split(dataset, train_dir=None, valid_dir=None, test_dir=None, frac_train=0.8, frac_valid=0.1, frac_test=0.1, seed=None, log_every_n=1000)[source]

Splits self into train/validation/test sets.

Most splitters use the superclass implementation Splitter.train_valid_test_split but this class has to override the implementation to deal with potentially ragged splits.

Parameters:
  • dataset (data like object.) – Dataset to be split. This should either be of type dc.data.Dataset or a type that dc.utils.data.datasetify can convert into a Dataset.
  • train_dir (str, optional) – If specified, the directory in which the generated training dataset should be stored. This is only considered if isinstance(dataset, dc.data.DiskDataset)
  • valid_dir (str, optional) – If specified, the directory in which the generated valid dataset should be stored. This is only considered if isinstance(dataset, dc.data.DiskDataset) is True.
  • test_dir (str, optional) – If specified, the directory in which the generated test dataset should be stored. This is only considered if isinstance(dataset, dc.data.DiskDataset) is True.
  • frac_train (float, optional (default 0.8)) – The fraction of data to be used for the training split.
  • frac_valid (float, optional (default 0.1)) – The fraction of data to be used for the validation split.
  • frac_test (float, optional (default 0.1)) – The fraction of data to be used for the test split.
  • seed (int, optional (default None)) – Random seed to use.
  • log_every_n (int, optional) – Controls the logger by dictating how often logger outputs will be produced.
Returns:

Return type:

Train and test datasets as dc.data.Dataset objects.

SingletaskStratifiedSplitter

class deepchem.splits.SingletaskStratifiedSplitter(task_number=0)[source]

Class for doing data splits by stratification on a single task.

Example

>>> n_samples = 100
>>> n_features = 10
>>> n_tasks = 10
>>> X = np.random.rand(n_samples, n_features)
>>> y = np.random.rand(n_samples, n_tasks)
>>> w = np.ones_like(y)
>>> dataset = DiskDataset.from_numpy(np.ones((100,n_tasks)), np.ones((100,n_tasks)))
>>> splitter = SingletaskStratifiedSplitter(task_number=5)
>>> train_dataset, test_dataset = splitter.train_test_split(dataset)
__init__(task_number=0)[source]

Creates splitter object.

Parameters:task_number (int (Optional, Default 0)) – Task number for stratification.
k_fold_split(dataset, k, directories=None, seed=None, log_every_n=None, **kwargs)[source]

Splits compounds into k-folds using stratified sampling. Overriding base class k_fold_split.

Parameters:
  • dataset (dc.data.Dataset object) – Dataset.
  • k (int) – Number of folds.
  • seed (int (Optional, Default None)) – Random seed.
  • log_every_n (int (Optional, Default None)) – Log every n examples (not currently used).
Returns:

fold_datasets – List containing dc.data.Dataset objects

Return type:

List

split(dataset, seed=None, frac_train=0.8, frac_valid=0.1, frac_test=0.1, log_every_n=None)[source]

Splits compounds into train/validation/test using stratified sampling.

Parameters:
  • dataset (dc.data.Dataset object) – Dataset.
  • seed (int (Optional, Default None)) – Random seed.
  • frac_train (float (Optional, Default .8)) – Fraction of dataset put into training data.
  • frac_valid (float (Optional, Default .1)) – Fraction of dataset put into validation data.
  • frac_test (float (Optional, Default .1)) – Fraction of dataset put into test data.
  • log_every_n (int (Optional, Default None)) – Log every n examples (not currently used).
Returns:

retval – Tuple containing train indices, valid indices, and test indices

Return type:

Tuple

train_test_split(dataset, train_dir=None, test_dir=None, seed=None, frac_train=0.8, **kwargs)[source]

Splits self into train/test sets.

Returns Dataset objects for train/test.

Parameters:
  • dataset (data like object) – Dataset to be split. This should either be of type dc.data.Dataset or a type that dc.utils.data.datasetify can convert into a Dataset.
  • train_dir (str, optional) – If specified, the directory in which the generated training dataset should be stored. This is only considered if isinstance(dataset, dc.data.DiskDataset) is True.
  • test_dir (str, optional) – If specified, the directory in which the generated test dataset should be stored. This is only considered if isinstance(dataset, dc.data.DiskDataset) is True.
  • seed (int, optional (default None)) – Random seed to use.
  • frac_train (float, optional (default 0.8)) – The fraction of data to be used for the training split.
Returns:

Return type:

Train and test datasets as dc.data.Dataset objects.

train_valid_test_split(dataset, train_dir=None, valid_dir=None, test_dir=None, frac_train=0.8, frac_valid=0.1, frac_test=0.1, seed=None, log_every_n=1000, **kwargs)[source]

Splits self into train/validation/test sets.

Returns Dataset objects for train, valid, test.

Parameters:
  • dataset (data like object.) – Dataset to be split. This should either be of type dc.data.Dataset or a type that dc.utils.data.datasetify can convert into a Dataset.
  • train_dir (str, optional) – If specified, the directory in which the generated training dataset should be stored. This is only considered if isinstance(dataset, dc.data.DiskDataset)
  • valid_dir (str, optional) – If specified, the directory in which the generated valid dataset should be stored. This is only considered if isinstance(dataset, dc.data.DiskDataset) is True.
  • test_dir (str, optional) – If specified, the directory in which the generated test dataset should be stored. This is only considered if isinstance(dataset, dc.data.DiskDataset) is True.
  • frac_train (float, optional (default 0.8)) – The fraction of data to be used for the training split.
  • frac_valid (float, optional (default 0.1)) – The fraction of data to be used for the validation split.
  • frac_test (float, optional (default 0.1)) – The fraction of data to be used for the test split.
  • seed (int, optional (default None)) – Random seed to use.
  • log_every_n (int, optional) – Controls the logger by dictating how often logger outputs will be produced.
Returns:

Return type:

Train and test datasets as dc.data.Dataset objects.

MolecularWeightSplitter

class deepchem.splits.MolecularWeightSplitter[source]

Class for doing data splits by molecular weight.

Note

This class requires rdkit to be installed.

__init__

Initialize self. See help(type(self)) for accurate signature.

k_fold_split(dataset, k, directories=None, **kwargs)[source]
Parameters:
  • dataset (dc.data.Dataset) – Dataset to do a k-fold split
  • k (int) – Number of folds to split dataset into.
  • directories (list[str]) – list of length 2*k filepaths to save the result disk-datasets
Returns:

  • list of length k tuples of (train, cv) where train and cv are both
  • lists of `Dataset`s.

split(dataset, seed=None, frac_train=0.8, frac_valid=0.1, frac_test=0.1, log_every_n=None)[source]

Splits on molecular weight.

Splits internal compounds into train/validation/test using the MW calculated by SMILES string.

Parameters:
  • dataset (dc.data.Dataset) – Dataset to be split
  • seed (int, optional (default None)) – Random seed to use.
  • frac_train (float, optional (default 0.8)) – The fraction of data to be used for the training split.
  • frac_valid (float, optional (default 0.1)) – The fraction of data to be used for the validation split.
  • frac_test (float, optional (default 0.1)) – The fraction of data to be used for the test split.
  • log_every_n (int, optional) – Controls the logger by dictating how often logger outputs will be produced.
Returns:

  • A tuple (train_inds, valid_inds, test_inds of the indices (integers) for
  • the various splits.

train_test_split(dataset, train_dir=None, test_dir=None, seed=None, frac_train=0.8, **kwargs)[source]

Splits self into train/test sets.

Returns Dataset objects for train/test.

Parameters:
  • dataset (data like object) – Dataset to be split. This should either be of type dc.data.Dataset or a type that dc.utils.data.datasetify can convert into a Dataset.
  • train_dir (str, optional) – If specified, the directory in which the generated training dataset should be stored. This is only considered if isinstance(dataset, dc.data.DiskDataset) is True.
  • test_dir (str, optional) – If specified, the directory in which the generated test dataset should be stored. This is only considered if isinstance(dataset, dc.data.DiskDataset) is True.
  • seed (int, optional (default None)) – Random seed to use.
  • frac_train (float, optional (default 0.8)) – The fraction of data to be used for the training split.
Returns:

Return type:

Train and test datasets as dc.data.Dataset objects.

train_valid_test_split(dataset, train_dir=None, valid_dir=None, test_dir=None, frac_train=0.8, frac_valid=0.1, frac_test=0.1, seed=None, log_every_n=1000, **kwargs)[source]

Splits self into train/validation/test sets.

Returns Dataset objects for train, valid, test.

Parameters:
  • dataset (data like object.) – Dataset to be split. This should either be of type dc.data.Dataset or a type that dc.utils.data.datasetify can convert into a Dataset.
  • train_dir (str, optional) – If specified, the directory in which the generated training dataset should be stored. This is only considered if isinstance(dataset, dc.data.DiskDataset)
  • valid_dir (str, optional) – If specified, the directory in which the generated valid dataset should be stored. This is only considered if isinstance(dataset, dc.data.DiskDataset) is True.
  • test_dir (str, optional) – If specified, the directory in which the generated test dataset should be stored. This is only considered if isinstance(dataset, dc.data.DiskDataset) is True.
  • frac_train (float, optional (default 0.8)) – The fraction of data to be used for the training split.
  • frac_valid (float, optional (default 0.1)) – The fraction of data to be used for the validation split.
  • frac_test (float, optional (default 0.1)) – The fraction of data to be used for the test split.
  • seed (int, optional (default None)) – Random seed to use.
  • log_every_n (int, optional) – Controls the logger by dictating how often logger outputs will be produced.
Returns:

Return type:

Train and test datasets as dc.data.Dataset objects.

MaxMinSplitter

class deepchem.splits.MaxMinSplitter[source]

Chemical diversity splitter.

Class for doing splits based on the MaxMin diversity algorithm. Intuitively, the test set is comprised of the most diverse compounds of the entire dataset. Furthermore, the validation set is comprised of diverse compounds under the test set.

Note

This class requires rdkit to be installed.

__init__

Initialize self. See help(type(self)) for accurate signature.

k_fold_split(dataset, k, directories=None, **kwargs)[source]
Parameters:
  • dataset (dc.data.Dataset) – Dataset to do a k-fold split
  • k (int) – Number of folds to split dataset into.
  • directories (list[str]) – list of length 2*k filepaths to save the result disk-datasets
Returns:

  • list of length k tuples of (train, cv) where train and cv are both
  • lists of `Dataset`s.

split(dataset, seed=None, frac_train=0.8, frac_valid=0.1, frac_test=0.1, log_every_n=None)[source]

Splits internal compounds randomly into train/validation/test.

train_test_split(dataset, train_dir=None, test_dir=None, seed=None, frac_train=0.8, **kwargs)[source]

Splits self into train/test sets.

Returns Dataset objects for train/test.

Parameters:
  • dataset (data like object) – Dataset to be split. This should either be of type dc.data.Dataset or a type that dc.utils.data.datasetify can convert into a Dataset.
  • train_dir (str, optional) – If specified, the directory in which the generated training dataset should be stored. This is only considered if isinstance(dataset, dc.data.DiskDataset) is True.
  • test_dir (str, optional) – If specified, the directory in which the generated test dataset should be stored. This is only considered if isinstance(dataset, dc.data.DiskDataset) is True.
  • seed (int, optional (default None)) – Random seed to use.
  • frac_train (float, optional (default 0.8)) – The fraction of data to be used for the training split.
Returns:

Return type:

Train and test datasets as dc.data.Dataset objects.

train_valid_test_split(dataset, train_dir=None, valid_dir=None, test_dir=None, frac_train=0.8, frac_valid=0.1, frac_test=0.1, seed=None, log_every_n=1000, **kwargs)[source]

Splits self into train/validation/test sets.

Returns Dataset objects for train, valid, test.

Parameters:
  • dataset (data like object.) – Dataset to be split. This should either be of type dc.data.Dataset or a type that dc.utils.data.datasetify can convert into a Dataset.
  • train_dir (str, optional) – If specified, the directory in which the generated training dataset should be stored. This is only considered if isinstance(dataset, dc.data.DiskDataset)
  • valid_dir (str, optional) – If specified, the directory in which the generated valid dataset should be stored. This is only considered if isinstance(dataset, dc.data.DiskDataset) is True.
  • test_dir (str, optional) – If specified, the directory in which the generated test dataset should be stored. This is only considered if isinstance(dataset, dc.data.DiskDataset) is True.
  • frac_train (float, optional (default 0.8)) – The fraction of data to be used for the training split.
  • frac_valid (float, optional (default 0.1)) – The fraction of data to be used for the validation split.
  • frac_test (float, optional (default 0.1)) – The fraction of data to be used for the test split.
  • seed (int, optional (default None)) – Random seed to use.
  • log_every_n (int, optional) – Controls the logger by dictating how often logger outputs will be produced.
Returns:

Return type:

Train and test datasets as dc.data.Dataset objects.

ButinaSplitter

class deepchem.splits.ButinaSplitter[source]

Class for doing data splits based on the butina clustering of a bulk tanimoto fingerprint matrix.

__init__

Initialize self. See help(type(self)) for accurate signature.

k_fold_split(dataset, k, directories=None, **kwargs)[source]
Parameters:
  • dataset (dc.data.Dataset) – Dataset to do a k-fold split
  • k (int) – Number of folds to split dataset into.
  • directories (list[str]) – list of length 2*k filepaths to save the result disk-datasets
Returns:

  • list of length k tuples of (train, cv) where train and cv are both
  • lists of `Dataset`s.

split(dataset, seed=None, frac_train=None, frac_valid=None, frac_test=None, log_every_n=1000, cutoff=0.18)[source]

Splits internal compounds into train and validation based on the butina clustering algorithm. This splitting algorithm has an O(N^2) run time, where N is the number of elements in the dataset. The dataset is expected to be a classification dataset.

This algorithm is designed to generate validation data that are novel chemotypes.

Note that this function entirely disregards the ratios for frac_train, frac_valid, and frac_test. Furthermore, it does not generate a test set, only a train and valid set.

Setting a small cutoff value will generate smaller, finer clusters of high similarity, whereas setting a large cutoff value will generate larger, coarser clusters of low similarity.

train_test_split(dataset, train_dir=None, test_dir=None, seed=None, frac_train=0.8, **kwargs)[source]

Splits self into train/test sets.

Returns Dataset objects for train/test.

Parameters:
  • dataset (data like object) – Dataset to be split. This should either be of type dc.data.Dataset or a type that dc.utils.data.datasetify can convert into a Dataset.
  • train_dir (str, optional) – If specified, the directory in which the generated training dataset should be stored. This is only considered if isinstance(dataset, dc.data.DiskDataset) is True.
  • test_dir (str, optional) – If specified, the directory in which the generated test dataset should be stored. This is only considered if isinstance(dataset, dc.data.DiskDataset) is True.
  • seed (int, optional (default None)) – Random seed to use.
  • frac_train (float, optional (default 0.8)) – The fraction of data to be used for the training split.
Returns:

Return type:

Train and test datasets as dc.data.Dataset objects.

train_valid_test_split(dataset, train_dir=None, valid_dir=None, test_dir=None, frac_train=0.8, frac_valid=0.1, frac_test=0.1, seed=None, log_every_n=1000, **kwargs)[source]

Splits self into train/validation/test sets.

Returns Dataset objects for train, valid, test.

Parameters:
  • dataset (data like object.) – Dataset to be split. This should either be of type dc.data.Dataset or a type that dc.utils.data.datasetify can convert into a Dataset.
  • train_dir (str, optional) – If specified, the directory in which the generated training dataset should be stored. This is only considered if isinstance(dataset, dc.data.DiskDataset)
  • valid_dir (str, optional) – If specified, the directory in which the generated valid dataset should be stored. This is only considered if isinstance(dataset, dc.data.DiskDataset) is True.
  • test_dir (str, optional) – If specified, the directory in which the generated test dataset should be stored. This is only considered if isinstance(dataset, dc.data.DiskDataset) is True.
  • frac_train (float, optional (default 0.8)) – The fraction of data to be used for the training split.
  • frac_valid (float, optional (default 0.1)) – The fraction of data to be used for the validation split.
  • frac_test (float, optional (default 0.1)) – The fraction of data to be used for the test split.
  • seed (int, optional (default None)) – Random seed to use.
  • log_every_n (int, optional) – Controls the logger by dictating how often logger outputs will be produced.
Returns:

Return type:

Train and test datasets as dc.data.Dataset objects.

ScaffoldSplitter

class deepchem.splits.ScaffoldSplitter[source]

Class for doing data splits based on the scaffold of small molecules.

__init__

Initialize self. See help(type(self)) for accurate signature.

generate_scaffolds(dataset, log_every_n=1000)[source]

Returns all scaffolds from the dataset

k_fold_split(dataset, k, directories=None, **kwargs)[source]
Parameters:
  • dataset (dc.data.Dataset) – Dataset to do a k-fold split
  • k (int) – Number of folds to split dataset into.
  • directories (list[str]) – list of length 2*k filepaths to save the result disk-datasets
Returns:

  • list of length k tuples of (train, cv) where train and cv are both
  • lists of `Dataset`s.

split(dataset, seed=None, frac_train=0.8, frac_valid=0.1, frac_test=0.1, log_every_n=1000)[source]

Splits internal compounds into train/validation/test by scaffold.

train_test_split(dataset, train_dir=None, test_dir=None, seed=None, frac_train=0.8, **kwargs)[source]

Splits self into train/test sets.

Returns Dataset objects for train/test.

Parameters:
  • dataset (data like object) – Dataset to be split. This should either be of type dc.data.Dataset or a type that dc.utils.data.datasetify can convert into a Dataset.
  • train_dir (str, optional) – If specified, the directory in which the generated training dataset should be stored. This is only considered if isinstance(dataset, dc.data.DiskDataset) is True.
  • test_dir (str, optional) – If specified, the directory in which the generated test dataset should be stored. This is only considered if isinstance(dataset, dc.data.DiskDataset) is True.
  • seed (int, optional (default None)) – Random seed to use.
  • frac_train (float, optional (default 0.8)) – The fraction of data to be used for the training split.
Returns:

Return type:

Train and test datasets as dc.data.Dataset objects.

train_valid_test_split(dataset, train_dir=None, valid_dir=None, test_dir=None, frac_train=0.8, frac_valid=0.1, frac_test=0.1, seed=None, log_every_n=1000, **kwargs)[source]

Splits self into train/validation/test sets.

Returns Dataset objects for train, valid, test.

Parameters:
  • dataset (data like object.) – Dataset to be split. This should either be of type dc.data.Dataset or a type that dc.utils.data.datasetify can convert into a Dataset.
  • train_dir (str, optional) – If specified, the directory in which the generated training dataset should be stored. This is only considered if isinstance(dataset, dc.data.DiskDataset)
  • valid_dir (str, optional) – If specified, the directory in which the generated valid dataset should be stored. This is only considered if isinstance(dataset, dc.data.DiskDataset) is True.
  • test_dir (str, optional) – If specified, the directory in which the generated test dataset should be stored. This is only considered if isinstance(dataset, dc.data.DiskDataset) is True.
  • frac_train (float, optional (default 0.8)) – The fraction of data to be used for the training split.
  • frac_valid (float, optional (default 0.1)) – The fraction of data to be used for the validation split.
  • frac_test (float, optional (default 0.1)) – The fraction of data to be used for the test split.
  • seed (int, optional (default None)) – Random seed to use.
  • log_every_n (int, optional) – Controls the logger by dictating how often logger outputs will be produced.
Returns:

Return type:

Train and test datasets as dc.data.Dataset objects.

FingeprintSplitter

class deepchem.splits.FingerprintSplitter[source]

Class for doing data splits based on the fingerprints of small molecules O(N**2) algorithm

__init__

Initialize self. See help(type(self)) for accurate signature.

k_fold_split(dataset, k, directories=None, **kwargs)[source]
Parameters:
  • dataset (dc.data.Dataset) – Dataset to do a k-fold split
  • k (int) – Number of folds to split dataset into.
  • directories (list[str]) – list of length 2*k filepaths to save the result disk-datasets
Returns:

  • list of length k tuples of (train, cv) where train and cv are both
  • lists of `Dataset`s.

split(dataset, seed=None, frac_train=0.8, frac_valid=0.1, frac_test=0.1, log_every_n=1000)[source]

Splits internal compounds into train/validation/test by fingerprint.

train_test_split(dataset, train_dir=None, test_dir=None, seed=None, frac_train=0.8, **kwargs)[source]

Splits self into train/test sets.

Returns Dataset objects for train/test.

Parameters:
  • dataset (data like object) – Dataset to be split. This should either be of type dc.data.Dataset or a type that dc.utils.data.datasetify can convert into a Dataset.
  • train_dir (str, optional) – If specified, the directory in which the generated training dataset should be stored. This is only considered if isinstance(dataset, dc.data.DiskDataset) is True.
  • test_dir (str, optional) – If specified, the directory in which the generated test dataset should be stored. This is only considered if isinstance(dataset, dc.data.DiskDataset) is True.
  • seed (int, optional (default None)) – Random seed to use.
  • frac_train (float, optional (default 0.8)) – The fraction of data to be used for the training split.
Returns:

Return type:

Train and test datasets as dc.data.Dataset objects.

train_valid_test_split(dataset, train_dir=None, valid_dir=None, test_dir=None, frac_train=0.8, frac_valid=0.1, frac_test=0.1, seed=None, log_every_n=1000, **kwargs)[source]

Splits self into train/validation/test sets.

Returns Dataset objects for train, valid, test.

Parameters:
  • dataset (data like object.) – Dataset to be split. This should either be of type dc.data.Dataset or a type that dc.utils.data.datasetify can convert into a Dataset.
  • train_dir (str, optional) – If specified, the directory in which the generated training dataset should be stored. This is only considered if isinstance(dataset, dc.data.DiskDataset)
  • valid_dir (str, optional) – If specified, the directory in which the generated valid dataset should be stored. This is only considered if isinstance(dataset, dc.data.DiskDataset) is True.
  • test_dir (str, optional) – If specified, the directory in which the generated test dataset should be stored. This is only considered if isinstance(dataset, dc.data.DiskDataset) is True.
  • frac_train (float, optional (default 0.8)) – The fraction of data to be used for the training split.
  • frac_valid (float, optional (default 0.1)) – The fraction of data to be used for the validation split.
  • frac_test (float, optional (default 0.1)) – The fraction of data to be used for the test split.
  • seed (int, optional (default None)) – Random seed to use.
  • log_every_n (int, optional) – Controls the logger by dictating how often logger outputs will be produced.
Returns:

Return type:

Train and test datasets as dc.data.Dataset objects.

TimeSplitterPDBbind

class deepchem.splits.TimeSplitterPDBbind(ids, year_file=None)[source]
__init__(ids, year_file=None)[source]

Initialize self. See help(type(self)) for accurate signature.

k_fold_split(dataset, k, directories=None, **kwargs)[source]
Parameters:
  • dataset (dc.data.Dataset) – Dataset to do a k-fold split
  • k (int) – Number of folds to split dataset into.
  • directories (list[str]) – list of length 2*k filepaths to save the result disk-datasets
Returns:

  • list of length k tuples of (train, cv) where train and cv are both
  • lists of `Dataset`s.

split(dataset, seed=None, frac_train=0.8, frac_valid=0.1, frac_test=0.1, log_every_n=None)[source]

Splits protein-ligand pairs in PDBbind into train/validation/test in time order.

train_test_split(dataset, train_dir=None, test_dir=None, seed=None, frac_train=0.8, **kwargs)[source]

Splits self into train/test sets.

Returns Dataset objects for train/test.

Parameters:
  • dataset (data like object) – Dataset to be split. This should either be of type dc.data.Dataset or a type that dc.utils.data.datasetify can convert into a Dataset.
  • train_dir (str, optional) – If specified, the directory in which the generated training dataset should be stored. This is only considered if isinstance(dataset, dc.data.DiskDataset) is True.
  • test_dir (str, optional) – If specified, the directory in which the generated test dataset should be stored. This is only considered if isinstance(dataset, dc.data.DiskDataset) is True.
  • seed (int, optional (default None)) – Random seed to use.
  • frac_train (float, optional (default 0.8)) – The fraction of data to be used for the training split.
Returns:

Return type:

Train and test datasets as dc.data.Dataset objects.

train_valid_test_split(dataset, train_dir=None, valid_dir=None, test_dir=None, frac_train=0.8, frac_valid=0.1, frac_test=0.1, seed=None, log_every_n=1000, **kwargs)[source]

Splits self into train/validation/test sets.

Returns Dataset objects for train, valid, test.

Parameters:
  • dataset (data like object.) – Dataset to be split. This should either be of type dc.data.Dataset or a type that dc.utils.data.datasetify can convert into a Dataset.
  • train_dir (str, optional) – If specified, the directory in which the generated training dataset should be stored. This is only considered if isinstance(dataset, dc.data.DiskDataset)
  • valid_dir (str, optional) – If specified, the directory in which the generated valid dataset should be stored. This is only considered if isinstance(dataset, dc.data.DiskDataset) is True.
  • test_dir (str, optional) – If specified, the directory in which the generated test dataset should be stored. This is only considered if isinstance(dataset, dc.data.DiskDataset) is True.
  • frac_train (float, optional (default 0.8)) – The fraction of data to be used for the training split.
  • frac_valid (float, optional (default 0.1)) – The fraction of data to be used for the validation split.
  • frac_test (float, optional (default 0.1)) – The fraction of data to be used for the test split.
  • seed (int, optional (default None)) – Random seed to use.
  • log_every_n (int, optional) – Controls the logger by dictating how often logger outputs will be produced.
Returns:

Return type:

Train and test datasets as dc.data.Dataset objects.