Splitters¶
DeepChem dc.splits.Splitter
objects are a tool to meaningfully
split DeepChem datasets for machine learning testing. The core idea is
that when evaluating a machine learning model, it’s useful to creating
training, validation and test splits of your source data. The training
split is used to train models, the validation is used to benchmark
different model architectures. The test is ideally held out till the
very end when it’s used to gauge a final estimate of the model’s
performance.
The dc.splits
module contains a collection of scientifically
aware splitters. In many cases, we want to evaluate scientific deep
learning models more rigorously than standard deep models since we’re
looking for the ability to generalize to new domains. Some of the
implemented splitters here may help.
Contents
General Splitters¶
RandomSplitter¶
- class RandomSplitter[source]¶
Class for doing random data splits.
Examples
>>> import numpy as np >>> import deepchem as dc >>> # Creating a dummy NumPy dataset >>> X, y = np.random.randn(5), np.random.randn(5) >>> dataset = dc.data.NumpyDataset(X, y) >>> # Creating a RandomSplitter object >>> splitter = dc.splits.RandomSplitter() >>> # Splitting dataset into train and test datasets >>> train_dataset, test_dataset = splitter.train_test_split(dataset)
- split(dataset: deepchem.data.datasets.Dataset, frac_train: float = 0.8, frac_valid: float = 0.1, frac_test: float = 0.1, seed: Optional[int] = None, log_every_n: Optional[int] = None) Tuple[numpy.ndarray, numpy.ndarray, numpy.ndarray] [source]¶
Splits internal compounds randomly into train/validation/test.
- Parameters
dataset (Dataset) – Dataset to be split.
seed (int, optional (default None)) – Random seed to use.
frac_train (float, optional (default 0.8)) – The fraction of data to be used for the training split.
frac_valid (float, optional (default 0.1)) – The fraction of data to be used for the validation split.
frac_test (float, optional (default 0.1)) – The fraction of data to be used for the test split.
seed – Random seed to use.
log_every_n (int, optional (default None)) – Log every n examples (not currently used).
- Returns
A tuple of train indices, valid indices, and test indices. Each indices is a numpy array.
- Return type
Tuple[np.ndarray, np.ndarray, np.ndarray]
- __repr__() str [source]¶
Convert self to repr representation.
- Returns
The string represents the class.
- Return type
str
Examples
>>> import deepchem as dc >>> dc.splits.RandomSplitter() RandomSplitter[]
- __str__() str [source]¶
Convert self to str representation.
- Returns
The string represents the class.
- Return type
str
Examples
>>> import deepchem as dc >>> str(dc.splits.RandomSplitter()) 'RandomSplitter'
- k_fold_split(dataset: deepchem.data.datasets.Dataset, k: int, directories: Optional[List[str]] = None, **kwargs) List[Tuple[deepchem.data.datasets.Dataset, deepchem.data.datasets.Dataset]] [source]¶
- Parameters
dataset (Dataset) – Dataset to do a k-fold split
k (int) – Number of folds to split dataset into.
directories (List[str], optional (default None)) – List of length 2*k filepaths to save the result disk-datasets.
- Returns
List of length k tuples of (train, cv) where train and cv are both Dataset.
- Return type
- train_test_split(dataset: deepchem.data.datasets.Dataset, train_dir: Optional[str] = None, test_dir: Optional[str] = None, frac_train: float = 0.8, seed: Optional[int] = None, **kwargs) Tuple[deepchem.data.datasets.Dataset, deepchem.data.datasets.Dataset] [source]¶
Splits self into train/test sets.
Returns Dataset objects for train/test.
- Parameters
dataset (data like object) – Dataset to be split.
train_dir (str, optional (default None)) – If specified, the directory in which the generated training dataset should be stored. This is only considered if isinstance(dataset, dc.data.DiskDataset) is True.
test_dir (str, optional (default None)) – If specified, the directory in which the generated test dataset should be stored. This is only considered if isinstance(dataset, dc.data.DiskDataset) is True.
frac_train (float, optional (default 0.8)) – The fraction of data to be used for the training split.
seed (int, optional (default None)) – Random seed to use.
- Returns
A tuple of train and test datasets as dc.data.Dataset objects.
- Return type
- train_valid_test_split(dataset: deepchem.data.datasets.Dataset, train_dir: Optional[str] = None, valid_dir: Optional[str] = None, test_dir: Optional[str] = None, frac_train: float = 0.8, frac_valid: float = 0.1, frac_test: float = 0.1, seed: Optional[int] = None, log_every_n: int = 1000, **kwargs) Tuple[deepchem.data.datasets.Dataset, deepchem.data.datasets.Dataset, deepchem.data.datasets.Dataset] [source]¶
Splits self into train/validation/test sets.
Returns Dataset objects for train, valid, test.
- Parameters
dataset (Dataset) – Dataset to be split.
train_dir (str, optional (default None)) – If specified, the directory in which the generated training dataset should be stored. This is only considered if isinstance(dataset, dc.data.DiskDataset)
valid_dir (str, optional (default None)) – If specified, the directory in which the generated valid dataset should be stored. This is only considered if isinstance(dataset, dc.data.DiskDataset) is True.
test_dir (str, optional (default None)) – If specified, the directory in which the generated test dataset should be stored. This is only considered if isinstance(dataset, dc.data.DiskDataset) is True.
frac_train (float, optional (default 0.8)) – The fraction of data to be used for the training split.
frac_valid (float, optional (default 0.1)) – The fraction of data to be used for the validation split.
frac_test (float, optional (default 0.1)) – The fraction of data to be used for the test split.
seed (int, optional (default None)) – Random seed to use.
log_every_n (int, optional (default 1000)) – Controls the logger by dictating how often logger outputs will be produced.
- Returns
A tuple of train, valid and test datasets as dc.data.Dataset objects.
- Return type
RandomGroupSplitter¶
- class RandomGroupSplitter(groups: Sequence)[source]¶
Random split based on groupings.
A splitter class that splits on groupings. An example use case is when there are multiple conformations of the same molecule that share the same topology. This splitter subsequently guarantees that resulting splits preserve groupings.
Note that it doesn’t do any dynamic programming or something fancy to try to maximize the choice such that frac_train, frac_valid, or frac_test is maximized. It simply permutes the groups themselves. As such, use with caution if the number of elements per group varies significantly.
Examples
>>> import deepchem as dc >>> import numpy as np >>> X=np.arange(12) >>> groups = [0, 0, 0, 1, 1, 1, 2, 2, 2, 3, 3, 3] >>> splitter = dc.splits.RandomGroupSplitter(groups=groups) >>> dataset = dc.data.NumpyDataset(X) # 12 elements >>> train, test = splitter.train_test_split(dataset, frac_train=0.75, seed=0) >>> print (train.ids) #array([6, 7, 8, 9, 10, 11, 3, 4, 5], dtype=object) [6 7 8 9 10 11 3 4 5]
- __init__(groups: Sequence)[source]¶
Initialize this object.
- Parameters
groups (Sequence) – An array indicating the group of each item. The length is equals to len(dataset.X)
Note
The examples of groups is the following.
groups : 3 2 2 0 1 1 2 4 3dataset.X : 0 1 2 3 4 5 6 7 8groups : a b b e q x a a rdataset.X : 0 1 2 3 4 5 6 7 8
- split(dataset: deepchem.data.datasets.Dataset, frac_train: float = 0.8, frac_valid: float = 0.1, frac_test: float = 0.1, seed: Optional[int] = None, log_every_n: Optional[int] = None) Tuple[List[int], List[int], List[int]] [source]¶
Return indices for specified split
- Parameters
dataset (Dataset) – Dataset to be split.
frac_train (float, optional (default 0.8)) – The fraction of data to be used for the training split.
frac_valid (float, optional (default 0.1)) – The fraction of data to be used for the validation split.
frac_test (float, optional (default 0.1)) – The fraction of data to be used for the test split.
seed (int, optional (default None)) – Random seed to use.
log_every_n (int, optional (default None)) – Log every n examples (not currently used).
- Returns
A tuple (train_inds, valid_inds, test_inds of the indices (integers) for the various splits.
- Return type
Tuple[List[int], List[int], List[int]]
- k_fold_split(dataset: deepchem.data.datasets.Dataset, k: int, directories: Optional[List[str]] = None, **kwargs) List[Tuple[deepchem.data.datasets.Dataset, deepchem.data.datasets.Dataset]] [source]¶
- Parameters
dataset (Dataset) – Dataset to do a k-fold split
k (int) – Number of folds to split dataset into.
directories (List[str], optional (default None)) – List of length 2*k filepaths to save the result disk-datasets.
- Returns
List of length k tuples of (train, cv) where train and cv are both Dataset.
- Return type
- train_test_split(dataset: deepchem.data.datasets.Dataset, train_dir: Optional[str] = None, test_dir: Optional[str] = None, frac_train: float = 0.8, seed: Optional[int] = None, **kwargs) Tuple[deepchem.data.datasets.Dataset, deepchem.data.datasets.Dataset] [source]¶
Splits self into train/test sets.
Returns Dataset objects for train/test.
- Parameters
dataset (data like object) – Dataset to be split.
train_dir (str, optional (default None)) – If specified, the directory in which the generated training dataset should be stored. This is only considered if isinstance(dataset, dc.data.DiskDataset) is True.
test_dir (str, optional (default None)) – If specified, the directory in which the generated test dataset should be stored. This is only considered if isinstance(dataset, dc.data.DiskDataset) is True.
frac_train (float, optional (default 0.8)) – The fraction of data to be used for the training split.
seed (int, optional (default None)) – Random seed to use.
- Returns
A tuple of train and test datasets as dc.data.Dataset objects.
- Return type
- train_valid_test_split(dataset: deepchem.data.datasets.Dataset, train_dir: Optional[str] = None, valid_dir: Optional[str] = None, test_dir: Optional[str] = None, frac_train: float = 0.8, frac_valid: float = 0.1, frac_test: float = 0.1, seed: Optional[int] = None, log_every_n: int = 1000, **kwargs) Tuple[deepchem.data.datasets.Dataset, deepchem.data.datasets.Dataset, deepchem.data.datasets.Dataset] [source]¶
Splits self into train/validation/test sets.
Returns Dataset objects for train, valid, test.
- Parameters
dataset (Dataset) – Dataset to be split.
train_dir (str, optional (default None)) – If specified, the directory in which the generated training dataset should be stored. This is only considered if isinstance(dataset, dc.data.DiskDataset)
valid_dir (str, optional (default None)) – If specified, the directory in which the generated valid dataset should be stored. This is only considered if isinstance(dataset, dc.data.DiskDataset) is True.
test_dir (str, optional (default None)) – If specified, the directory in which the generated test dataset should be stored. This is only considered if isinstance(dataset, dc.data.DiskDataset) is True.
frac_train (float, optional (default 0.8)) – The fraction of data to be used for the training split.
frac_valid (float, optional (default 0.1)) – The fraction of data to be used for the validation split.
frac_test (float, optional (default 0.1)) – The fraction of data to be used for the test split.
seed (int, optional (default None)) – Random seed to use.
log_every_n (int, optional (default 1000)) – Controls the logger by dictating how often logger outputs will be produced.
- Returns
A tuple of train, valid and test datasets as dc.data.Dataset objects.
- Return type
RandomStratifiedSplitter¶
- class RandomStratifiedSplitter[source]¶
RandomStratified Splitter class.
For sparse multitask datasets, a standard split offers no guarantees that the splits will have any active compounds. This class tries to arrange that each split has a proportional number of the actives for each task. This is strictly guaranteed only for single-task datasets, but for sparse multitask datasets it usually manages to produces a fairly accurate division of the actives for each task.
Note
This splitter is primarily designed for boolean labeled data. It considers only whether a label is zero or non-zero. When labels can take on multiple non-zero values, it does not try to give each split a proportional fraction of the samples with each value.
Examples
>>> import deepchem as dc >>> import numpy as np >>> from typing import Sequence >>> # creation of demo data set with some smiles strings >>> smiles= ['C', 'CC', 'CCC', 'CCCC', 'CCCCC'] >>> Xs = np.zeros(len(smiles)) >>> # creation of a deepchem dataset with the smile codes in the ids field >>> dataset = dc.data.DiskDataset.from_numpy(X=Xs,ids=smiles) >>> randomstratifiedsplitter = dc.splits.RandomStratifiedSplitter() >>> train_dataset, test_dataset = randomstratifiedsplitter.train_test_split(dataset)
- split(dataset: deepchem.data.datasets.Dataset, frac_train: float = 0.8, frac_valid: float = 0.1, frac_test: float = 0.1, seed: Optional[int] = None, log_every_n: Optional[int] = None) Tuple [source]¶
Return indices for specified split
- Parameters
dataset (dc.data.Dataset) – Dataset to be split.
seed (int, optional (default None)) – Random seed to use.
frac_train (float, optional (default 0.8)) – The fraction of data to be used for the training split.
frac_valid (float, optional (default 0.1)) – The fraction of data to be used for the validation split.
frac_test (float, optional (default 0.1)) – The fraction of data to be used for the test split.
log_every_n (int, optional (default None)) – Controls the logger by dictating how often logger outputs will be produced.
- Returns
A tuple (train_inds, valid_inds, test_inds) of the indices (integers) for the various splits.
- Return type
Tuple
- __repr__() str [source]¶
Convert self to repr representation.
- Returns
The string represents the class.
- Return type
str
Examples
>>> import deepchem as dc >>> dc.splits.RandomSplitter() RandomSplitter[]
- __str__() str [source]¶
Convert self to str representation.
- Returns
The string represents the class.
- Return type
str
Examples
>>> import deepchem as dc >>> str(dc.splits.RandomSplitter()) 'RandomSplitter'
- k_fold_split(dataset: deepchem.data.datasets.Dataset, k: int, directories: Optional[List[str]] = None, **kwargs) List[Tuple[deepchem.data.datasets.Dataset, deepchem.data.datasets.Dataset]] [source]¶
- Parameters
dataset (Dataset) – Dataset to do a k-fold split
k (int) – Number of folds to split dataset into.
directories (List[str], optional (default None)) – List of length 2*k filepaths to save the result disk-datasets.
- Returns
List of length k tuples of (train, cv) where train and cv are both Dataset.
- Return type
- train_test_split(dataset: deepchem.data.datasets.Dataset, train_dir: Optional[str] = None, test_dir: Optional[str] = None, frac_train: float = 0.8, seed: Optional[int] = None, **kwargs) Tuple[deepchem.data.datasets.Dataset, deepchem.data.datasets.Dataset] [source]¶
Splits self into train/test sets.
Returns Dataset objects for train/test.
- Parameters
dataset (data like object) – Dataset to be split.
train_dir (str, optional (default None)) – If specified, the directory in which the generated training dataset should be stored. This is only considered if isinstance(dataset, dc.data.DiskDataset) is True.
test_dir (str, optional (default None)) – If specified, the directory in which the generated test dataset should be stored. This is only considered if isinstance(dataset, dc.data.DiskDataset) is True.
frac_train (float, optional (default 0.8)) – The fraction of data to be used for the training split.
seed (int, optional (default None)) – Random seed to use.
- Returns
A tuple of train and test datasets as dc.data.Dataset objects.
- Return type
- train_valid_test_split(dataset: deepchem.data.datasets.Dataset, train_dir: Optional[str] = None, valid_dir: Optional[str] = None, test_dir: Optional[str] = None, frac_train: float = 0.8, frac_valid: float = 0.1, frac_test: float = 0.1, seed: Optional[int] = None, log_every_n: int = 1000, **kwargs) Tuple[deepchem.data.datasets.Dataset, deepchem.data.datasets.Dataset, deepchem.data.datasets.Dataset] [source]¶
Splits self into train/validation/test sets.
Returns Dataset objects for train, valid, test.
- Parameters
dataset (Dataset) – Dataset to be split.
train_dir (str, optional (default None)) – If specified, the directory in which the generated training dataset should be stored. This is only considered if isinstance(dataset, dc.data.DiskDataset)
valid_dir (str, optional (default None)) – If specified, the directory in which the generated valid dataset should be stored. This is only considered if isinstance(dataset, dc.data.DiskDataset) is True.
test_dir (str, optional (default None)) – If specified, the directory in which the generated test dataset should be stored. This is only considered if isinstance(dataset, dc.data.DiskDataset) is True.
frac_train (float, optional (default 0.8)) – The fraction of data to be used for the training split.
frac_valid (float, optional (default 0.1)) – The fraction of data to be used for the validation split.
frac_test (float, optional (default 0.1)) – The fraction of data to be used for the test split.
seed (int, optional (default None)) – Random seed to use.
log_every_n (int, optional (default 1000)) – Controls the logger by dictating how often logger outputs will be produced.
- Returns
A tuple of train, valid and test datasets as dc.data.Dataset objects.
- Return type
SingletaskStratifiedSplitter¶
- class SingletaskStratifiedSplitter(task_number: int = 0)[source]¶
Class for doing data splits by stratification on a single task.
Examples
>>> n_samples = 100 >>> n_features = 10 >>> n_tasks = 10 >>> X = np.random.rand(n_samples, n_features) >>> y = np.random.rand(n_samples, n_tasks) >>> w = np.ones_like(y) >>> dataset = DiskDataset.from_numpy(np.ones((100,n_tasks)), np.ones((100,n_tasks))) >>> splitter = SingletaskStratifiedSplitter(task_number=5) >>> train_dataset, test_dataset = splitter.train_test_split(dataset)
- __init__(task_number: int = 0)[source]¶
Creates splitter object.
- Parameters
task_number (int, optional (default 0)) – Task number for stratification.
- k_fold_split(dataset: deepchem.data.datasets.Dataset, k: int, directories: Optional[List[str]] = None, seed: Optional[int] = None, log_every_n: Optional[int] = None, **kwargs) List[deepchem.data.datasets.Dataset] [source]¶
Splits compounds into k-folds using stratified sampling. Overriding base class k_fold_split.
- Parameters
dataset (Dataset) – Dataset to be split.
k (int) – Number of folds to split dataset into.
directories (List[str], optional (default None)) – List of length k filepaths to save the result disk-datasets.
seed (int, optional (default None)) – Random seed to use.
log_every_n (int, optional (default None)) – Log every n examples (not currently used).
- Returns
fold_datasets – List of dc.data.Dataset objects
- Return type
List[Dataset]
- split(dataset: deepchem.data.datasets.Dataset, frac_train: float = 0.8, frac_valid: float = 0.1, frac_test: float = 0.1, seed: Optional[int] = None, log_every_n: Optional[int] = None) Tuple[numpy.ndarray, numpy.ndarray, numpy.ndarray] [source]¶
Splits compounds into train/validation/test using stratified sampling.
- Parameters
dataset (Dataset) – Dataset to be split.
frac_train (float, optional (default 0.8)) – Fraction of dataset put into training data.
frac_valid (float, optional (default 0.1)) – Fraction of dataset put into validation data.
frac_test (float, optional (default 0.1)) – Fraction of dataset put into test data.
seed (int, optional (default None)) – Random seed to use.
log_every_n (int, optional (default None)) – Log every n examples (not currently used).
- Returns
A tuple of train indices, valid indices, and test indices. Each indices is a numpy array.
- Return type
Tuple[np.ndarray, np.ndarray, np.ndarray]
- train_test_split(dataset: deepchem.data.datasets.Dataset, train_dir: Optional[str] = None, test_dir: Optional[str] = None, frac_train: float = 0.8, seed: Optional[int] = None, **kwargs) Tuple[deepchem.data.datasets.Dataset, deepchem.data.datasets.Dataset] [source]¶
Splits self into train/test sets.
Returns Dataset objects for train/test.
- Parameters
dataset (data like object) – Dataset to be split.
train_dir (str, optional (default None)) – If specified, the directory in which the generated training dataset should be stored. This is only considered if isinstance(dataset, dc.data.DiskDataset) is True.
test_dir (str, optional (default None)) – If specified, the directory in which the generated test dataset should be stored. This is only considered if isinstance(dataset, dc.data.DiskDataset) is True.
frac_train (float, optional (default 0.8)) – The fraction of data to be used for the training split.
seed (int, optional (default None)) – Random seed to use.
- Returns
A tuple of train and test datasets as dc.data.Dataset objects.
- Return type
- train_valid_test_split(dataset: deepchem.data.datasets.Dataset, train_dir: Optional[str] = None, valid_dir: Optional[str] = None, test_dir: Optional[str] = None, frac_train: float = 0.8, frac_valid: float = 0.1, frac_test: float = 0.1, seed: Optional[int] = None, log_every_n: int = 1000, **kwargs) Tuple[deepchem.data.datasets.Dataset, deepchem.data.datasets.Dataset, deepchem.data.datasets.Dataset] [source]¶
Splits self into train/validation/test sets.
Returns Dataset objects for train, valid, test.
- Parameters
dataset (Dataset) – Dataset to be split.
train_dir (str, optional (default None)) – If specified, the directory in which the generated training dataset should be stored. This is only considered if isinstance(dataset, dc.data.DiskDataset)
valid_dir (str, optional (default None)) – If specified, the directory in which the generated valid dataset should be stored. This is only considered if isinstance(dataset, dc.data.DiskDataset) is True.
test_dir (str, optional (default None)) – If specified, the directory in which the generated test dataset should be stored. This is only considered if isinstance(dataset, dc.data.DiskDataset) is True.
frac_train (float, optional (default 0.8)) – The fraction of data to be used for the training split.
frac_valid (float, optional (default 0.1)) – The fraction of data to be used for the validation split.
frac_test (float, optional (default 0.1)) – The fraction of data to be used for the test split.
seed (int, optional (default None)) – Random seed to use.
log_every_n (int, optional (default 1000)) – Controls the logger by dictating how often logger outputs will be produced.
- Returns
A tuple of train, valid and test datasets as dc.data.Dataset objects.
- Return type
IndexSplitter¶
- class IndexSplitter[source]¶
Class for simple order based splits.
Use this class when the Dataset you have is already ordered sa you would like it to be processed. Then the first frac_train proportion is used for training, the next frac_valid for validation, and the final frac_test for testing. This class may make sense to use your Dataset is already time ordered (for example).
Examples
>>> import deepchem as dc >>> import numpy as np >>> n_samples = 5 >>> n_features = 2 >>> X = np.random.rand(n_samples, n_features) >>> y = np.random.rand(n_samples) >>> indexsplitter = dc.splits.IndexSplitter() >>> dataset = dc.data.NumpyDataset(X, y) >>> train_dataset, test_dataset = indexsplitter.train_test_split(dataset) >>> print(train_dataset.ids) [0 1 2 3] >>> print (test_dataset.ids) [4]
- split(dataset: deepchem.data.datasets.Dataset, frac_train: float = 0.8, frac_valid: float = 0.1, frac_test: float = 0.1, seed: Optional[int] = None, log_every_n: Optional[int] = None) Tuple[numpy.ndarray, numpy.ndarray, numpy.ndarray] [source]¶
Splits internal compounds into train/validation/test in provided order.
- Parameters
dataset (Dataset) – Dataset to be split.
frac_train (float, optional (default 0.8)) – The fraction of data to be used for the training split.
frac_valid (float, optional (default 0.1)) – The fraction of data to be used for the validation split.
frac_test (float, optional (default 0.1)) – The fraction of data to be used for the test split.
seed (int, optional (default None)) – Random seed to use.
log_every_n (int, optional) – Log every n examples (not currently used).
- Returns
A tuple of train indices, valid indices, and test indices. Each indices is a numpy array.
- Return type
Tuple[np.ndarray, np.ndarray, np.ndarray]
- __repr__() str [source]¶
Convert self to repr representation.
- Returns
The string represents the class.
- Return type
str
Examples
>>> import deepchem as dc >>> dc.splits.RandomSplitter() RandomSplitter[]
- __str__() str [source]¶
Convert self to str representation.
- Returns
The string represents the class.
- Return type
str
Examples
>>> import deepchem as dc >>> str(dc.splits.RandomSplitter()) 'RandomSplitter'
- k_fold_split(dataset: deepchem.data.datasets.Dataset, k: int, directories: Optional[List[str]] = None, **kwargs) List[Tuple[deepchem.data.datasets.Dataset, deepchem.data.datasets.Dataset]] [source]¶
- Parameters
dataset (Dataset) – Dataset to do a k-fold split
k (int) – Number of folds to split dataset into.
directories (List[str], optional (default None)) – List of length 2*k filepaths to save the result disk-datasets.
- Returns
List of length k tuples of (train, cv) where train and cv are both Dataset.
- Return type
- train_test_split(dataset: deepchem.data.datasets.Dataset, train_dir: Optional[str] = None, test_dir: Optional[str] = None, frac_train: float = 0.8, seed: Optional[int] = None, **kwargs) Tuple[deepchem.data.datasets.Dataset, deepchem.data.datasets.Dataset] [source]¶
Splits self into train/test sets.
Returns Dataset objects for train/test.
- Parameters
dataset (data like object) – Dataset to be split.
train_dir (str, optional (default None)) – If specified, the directory in which the generated training dataset should be stored. This is only considered if isinstance(dataset, dc.data.DiskDataset) is True.
test_dir (str, optional (default None)) – If specified, the directory in which the generated test dataset should be stored. This is only considered if isinstance(dataset, dc.data.DiskDataset) is True.
frac_train (float, optional (default 0.8)) – The fraction of data to be used for the training split.
seed (int, optional (default None)) – Random seed to use.
- Returns
A tuple of train and test datasets as dc.data.Dataset objects.
- Return type
- train_valid_test_split(dataset: deepchem.data.datasets.Dataset, train_dir: Optional[str] = None, valid_dir: Optional[str] = None, test_dir: Optional[str] = None, frac_train: float = 0.8, frac_valid: float = 0.1, frac_test: float = 0.1, seed: Optional[int] = None, log_every_n: int = 1000, **kwargs) Tuple[deepchem.data.datasets.Dataset, deepchem.data.datasets.Dataset, deepchem.data.datasets.Dataset] [source]¶
Splits self into train/validation/test sets.
Returns Dataset objects for train, valid, test.
- Parameters
dataset (Dataset) – Dataset to be split.
train_dir (str, optional (default None)) – If specified, the directory in which the generated training dataset should be stored. This is only considered if isinstance(dataset, dc.data.DiskDataset)
valid_dir (str, optional (default None)) – If specified, the directory in which the generated valid dataset should be stored. This is only considered if isinstance(dataset, dc.data.DiskDataset) is True.
test_dir (str, optional (default None)) – If specified, the directory in which the generated test dataset should be stored. This is only considered if isinstance(dataset, dc.data.DiskDataset) is True.
frac_train (float, optional (default 0.8)) – The fraction of data to be used for the training split.
frac_valid (float, optional (default 0.1)) – The fraction of data to be used for the validation split.
frac_test (float, optional (default 0.1)) – The fraction of data to be used for the test split.
seed (int, optional (default None)) – Random seed to use.
log_every_n (int, optional (default 1000)) – Controls the logger by dictating how often logger outputs will be produced.
- Returns
A tuple of train, valid and test datasets as dc.data.Dataset objects.
- Return type
SpecifiedSplitter¶
- class SpecifiedSplitter(valid_indices: Optional[List[int]] = None, test_indices: Optional[List[int]] = None)[source]¶
Split data in the fashion specified by user.
For some applications, you will already know how you’d like to split the dataset. In this splitter, you simplify specify valid_indices and test_indices and the datapoints at those indices are pulled out of the dataset. Note that this is different from IndexSplitter which only splits based on the existing dataset ordering, while this SpecifiedSplitter can split on any specified ordering.
Examples
>>> import deepchem as dc >>> import numpy as np >>> n_samples = 10 >>> n_features = 3 >>> n_tasks = 1 >>> X = np.random.rand(n_samples, n_features) >>> y = np.random.rand(n_samples, n_tasks) >>> splitter = dc.splits.SpecifiedSplitter(valid_indices=[1,3,5], test_indices=[0,2,7,9]) >>> dataset = dc.data.NumpyDataset(X, y) >>> train_dataset, valid_dataset, test_dataset = splitter.train_valid_test_split(dataset) >>> print(train_dataset.ids) [4 6 8] >>> print(valid_dataset.ids) [1 3 5] >>> print(test_dataset.ids) [0 2 7 9]
- __init__(valid_indices: Optional[List[int]] = None, test_indices: Optional[List[int]] = None)[source]¶
- Parameters
valid_indices (List[int]) – List of indices of samples in the valid set
test_indices (List[int]) – List of indices of samples in the test set
- split(dataset: deepchem.data.datasets.Dataset, frac_train: float = 0.8, frac_valid: float = 0.1, frac_test: float = 0.1, seed: Optional[int] = None, log_every_n: Optional[int] = None) Tuple[numpy.ndarray, numpy.ndarray, numpy.ndarray] [source]¶
Splits internal compounds into train/validation/test in designated order.
- Parameters
dataset (Dataset) – Dataset to be split.
frac_train (float, optional (default 0.8)) – Fraction of dataset put into training data.
frac_valid (float, optional (default 0.1)) – Fraction of dataset put into validation data.
frac_test (float, optional (default 0.1)) – Fraction of dataset put into test data.
seed (int, optional (default None)) – Random seed to use.
log_every_n (int, optional (default None)) – Log every n examples (not currently used).
- Returns
A tuple of train indices, valid indices, and test indices. Each indices is a numpy array.
- Return type
Tuple[np.ndarray, np.ndarray, np.ndarray]
- k_fold_split(dataset: deepchem.data.datasets.Dataset, k: int, directories: Optional[List[str]] = None, **kwargs) List[Tuple[deepchem.data.datasets.Dataset, deepchem.data.datasets.Dataset]] [source]¶
- Parameters
dataset (Dataset) – Dataset to do a k-fold split
k (int) – Number of folds to split dataset into.
directories (List[str], optional (default None)) – List of length 2*k filepaths to save the result disk-datasets.
- Returns
List of length k tuples of (train, cv) where train and cv are both Dataset.
- Return type
- train_test_split(dataset: deepchem.data.datasets.Dataset, train_dir: Optional[str] = None, test_dir: Optional[str] = None, frac_train: float = 0.8, seed: Optional[int] = None, **kwargs) Tuple[deepchem.data.datasets.Dataset, deepchem.data.datasets.Dataset] [source]¶
Splits self into train/test sets.
Returns Dataset objects for train/test.
- Parameters
dataset (data like object) – Dataset to be split.
train_dir (str, optional (default None)) – If specified, the directory in which the generated training dataset should be stored. This is only considered if isinstance(dataset, dc.data.DiskDataset) is True.
test_dir (str, optional (default None)) – If specified, the directory in which the generated test dataset should be stored. This is only considered if isinstance(dataset, dc.data.DiskDataset) is True.
frac_train (float, optional (default 0.8)) – The fraction of data to be used for the training split.
seed (int, optional (default None)) – Random seed to use.
- Returns
A tuple of train and test datasets as dc.data.Dataset objects.
- Return type
- train_valid_test_split(dataset: deepchem.data.datasets.Dataset, train_dir: Optional[str] = None, valid_dir: Optional[str] = None, test_dir: Optional[str] = None, frac_train: float = 0.8, frac_valid: float = 0.1, frac_test: float = 0.1, seed: Optional[int] = None, log_every_n: int = 1000, **kwargs) Tuple[deepchem.data.datasets.Dataset, deepchem.data.datasets.Dataset, deepchem.data.datasets.Dataset] [source]¶
Splits self into train/validation/test sets.
Returns Dataset objects for train, valid, test.
- Parameters
dataset (Dataset) – Dataset to be split.
train_dir (str, optional (default None)) – If specified, the directory in which the generated training dataset should be stored. This is only considered if isinstance(dataset, dc.data.DiskDataset)
valid_dir (str, optional (default None)) – If specified, the directory in which the generated valid dataset should be stored. This is only considered if isinstance(dataset, dc.data.DiskDataset) is True.
test_dir (str, optional (default None)) – If specified, the directory in which the generated test dataset should be stored. This is only considered if isinstance(dataset, dc.data.DiskDataset) is True.
frac_train (float, optional (default 0.8)) – The fraction of data to be used for the training split.
frac_valid (float, optional (default 0.1)) – The fraction of data to be used for the validation split.
frac_test (float, optional (default 0.1)) – The fraction of data to be used for the test split.
seed (int, optional (default None)) – Random seed to use.
log_every_n (int, optional (default 1000)) – Controls the logger by dictating how often logger outputs will be produced.
- Returns
A tuple of train, valid and test datasets as dc.data.Dataset objects.
- Return type
TaskSplitter¶
- class TaskSplitter[source]¶
Provides a simple interface for splitting datasets task-wise.
For some learning problems, the training and test datasets should have different tasks entirely. This is a different paradigm from the usual Splitter, which ensures that split datasets have different datapoints, not different tasks.
- train_valid_test_split(dataset, frac_train=0.8, frac_valid=0.1, frac_test=0.1)[source]¶
Performs a train/valid/test split of the tasks for dataset.
If split is uneven, spillover goes to test.
- Parameters
dataset (dc.data.Dataset) – Dataset to be split
frac_train (float, optional) – Proportion of tasks to be put into train. Rounded to nearest int.
frac_valid (float, optional) – Proportion of tasks to be put into valid. Rounded to nearest int.
frac_test (float, optional) – Proportion of tasks to be put into test. Rounded to nearest int.
- k_fold_split(dataset, K)[source]¶
Performs a K-fold split of the tasks for dataset.
If split is uneven, spillover goes to last fold.
- Parameters
dataset (dc.data.Dataset) – Dataset to be split
K (int) – Number of splits to be made
- split(dataset: deepchem.data.datasets.Dataset, frac_train: float = 0.8, frac_valid: float = 0.1, frac_test: float = 0.1, seed: Optional[int] = None, log_every_n: Optional[int] = None) Tuple [source]¶
Return indices for specified split
- Parameters
dataset (dc.data.Dataset) – Dataset to be split.
seed (int, optional (default None)) – Random seed to use.
frac_train (float, optional (default 0.8)) – The fraction of data to be used for the training split.
frac_valid (float, optional (default 0.1)) – The fraction of data to be used for the validation split.
frac_test (float, optional (default 0.1)) – The fraction of data to be used for the test split.
log_every_n (int, optional (default None)) – Controls the logger by dictating how often logger outputs will be produced.
- Returns
A tuple (train_inds, valid_inds, test_inds) of the indices (integers) for the various splits.
- Return type
Tuple
- train_test_split(dataset: deepchem.data.datasets.Dataset, train_dir: Optional[str] = None, test_dir: Optional[str] = None, frac_train: float = 0.8, seed: Optional[int] = None, **kwargs) Tuple[deepchem.data.datasets.Dataset, deepchem.data.datasets.Dataset] [source]¶
Splits self into train/test sets.
Returns Dataset objects for train/test.
- Parameters
dataset (data like object) – Dataset to be split.
train_dir (str, optional (default None)) – If specified, the directory in which the generated training dataset should be stored. This is only considered if isinstance(dataset, dc.data.DiskDataset) is True.
test_dir (str, optional (default None)) – If specified, the directory in which the generated test dataset should be stored. This is only considered if isinstance(dataset, dc.data.DiskDataset) is True.
frac_train (float, optional (default 0.8)) – The fraction of data to be used for the training split.
seed (int, optional (default None)) – Random seed to use.
- Returns
A tuple of train and test datasets as dc.data.Dataset objects.
- Return type
Molecule Splitters¶
ScaffoldSplitter¶
- class ScaffoldSplitter[source]¶
Class for doing data splits based on the scaffold of small molecules.
Group molecules based on the Bemis-Murcko scaffold representation, which identifies rings, linkers, frameworks (combinations between linkers and rings) and atomic properties such as atom type, hibridization and bond order in a dataset of molecules. Then split the groups by the number of molecules in each group in decreasing order.
It is necessary to add the smiles representation in the ids field during the DiskDataset creation.
Examples
>>> import deepchem as dc >>> # creation of demo data set with some smiles strings ... data_test= ["CC(C)Cl" , "CCC(C)CO" , "CCCCCCCO" , "CCCCCCCC(=O)OC" , "c3ccc2nc1ccccc1cc2c3" , "Nc2cccc3nc1ccccc1cc23" , "C1CCCCCC1" ] >>> Xs = np.zeros(len(data_test)) >>> Ys = np.ones(len(data_test)) >>> # creation of a deepchem dataset with the smile codes in the ids field ... dataset = dc.data.DiskDataset.from_numpy(X=Xs,y=Ys,w=np.zeros(len(data_test)),ids=data_test) >>> scaffoldsplitter = dc.splits.ScaffoldSplitter() >>> train,test = scaffoldsplitter.train_test_split(dataset) >>> train <DiskDataset X.shape: (5,), y.shape: (5,), w.shape: (5,), ids: ['CC(C)Cl' 'CCC(C)CO' 'CCCCCCCO' 'CCCCCCCC(=O)OC' 'C1CCCCCC1'], task_names: [0]>
References
- 1
Bemis, Guy W., and Mark A. Murcko. “The properties of known drugs. 1. Molecular frameworks.” Journal of medicinal chemistry 39.15 (1996): 2887-2893.
Note
This class requires RDKit to be installed.
- split(dataset: deepchem.data.datasets.Dataset, frac_train: float = 0.8, frac_valid: float = 0.1, frac_test: float = 0.1, seed: Optional[int] = None, log_every_n: Optional[int] = 1000) Tuple[List[int], List[int], List[int]] [source]¶
Splits internal compounds into train/validation/test by scaffold.
- Parameters
dataset (Dataset) – Dataset to be split.
frac_train (float, optional (default 0.8)) – The fraction of data to be used for the training split.
frac_valid (float, optional (default 0.1)) – The fraction of data to be used for the validation split.
frac_test (float, optional (default 0.1)) – The fraction of data to be used for the test split.
seed (int, optional (default None)) – Random seed to use.
log_every_n (int, optional (default 1000)) – Controls the logger by dictating how often logger outputs will be produced.
- Returns
A tuple of train indices, valid indices, and test indices. Each indices is a list of integers.
- Return type
Tuple[List[int], List[int], List[int]]
- generate_scaffolds(dataset: deepchem.data.datasets.Dataset, log_every_n: int = 1000) List[List[int]] [source]¶
Returns all scaffolds from the dataset.
- Parameters
dataset (Dataset) – Dataset to be split.
log_every_n (int, optional (default 1000)) – Controls the logger by dictating how often logger outputs will be produced.
- Returns
scaffold_sets – List of indices of each scaffold in the dataset.
- Return type
List[List[int]]
- __repr__() str [source]¶
Convert self to repr representation.
- Returns
The string represents the class.
- Return type
str
Examples
>>> import deepchem as dc >>> dc.splits.RandomSplitter() RandomSplitter[]
- __str__() str [source]¶
Convert self to str representation.
- Returns
The string represents the class.
- Return type
str
Examples
>>> import deepchem as dc >>> str(dc.splits.RandomSplitter()) 'RandomSplitter'
- k_fold_split(dataset: deepchem.data.datasets.Dataset, k: int, directories: Optional[List[str]] = None, **kwargs) List[Tuple[deepchem.data.datasets.Dataset, deepchem.data.datasets.Dataset]] [source]¶
- Parameters
dataset (Dataset) – Dataset to do a k-fold split
k (int) – Number of folds to split dataset into.
directories (List[str], optional (default None)) – List of length 2*k filepaths to save the result disk-datasets.
- Returns
List of length k tuples of (train, cv) where train and cv are both Dataset.
- Return type
- train_test_split(dataset: deepchem.data.datasets.Dataset, train_dir: Optional[str] = None, test_dir: Optional[str] = None, frac_train: float = 0.8, seed: Optional[int] = None, **kwargs) Tuple[deepchem.data.datasets.Dataset, deepchem.data.datasets.Dataset] [source]¶
Splits self into train/test sets.
Returns Dataset objects for train/test.
- Parameters
dataset (data like object) – Dataset to be split.
train_dir (str, optional (default None)) – If specified, the directory in which the generated training dataset should be stored. This is only considered if isinstance(dataset, dc.data.DiskDataset) is True.
test_dir (str, optional (default None)) – If specified, the directory in which the generated test dataset should be stored. This is only considered if isinstance(dataset, dc.data.DiskDataset) is True.
frac_train (float, optional (default 0.8)) – The fraction of data to be used for the training split.
seed (int, optional (default None)) – Random seed to use.
- Returns
A tuple of train and test datasets as dc.data.Dataset objects.
- Return type
- train_valid_test_split(dataset: deepchem.data.datasets.Dataset, train_dir: Optional[str] = None, valid_dir: Optional[str] = None, test_dir: Optional[str] = None, frac_train: float = 0.8, frac_valid: float = 0.1, frac_test: float = 0.1, seed: Optional[int] = None, log_every_n: int = 1000, **kwargs) Tuple[deepchem.data.datasets.Dataset, deepchem.data.datasets.Dataset, deepchem.data.datasets.Dataset] [source]¶
Splits self into train/validation/test sets.
Returns Dataset objects for train, valid, test.
- Parameters
dataset (Dataset) – Dataset to be split.
train_dir (str, optional (default None)) – If specified, the directory in which the generated training dataset should be stored. This is only considered if isinstance(dataset, dc.data.DiskDataset)
valid_dir (str, optional (default None)) – If specified, the directory in which the generated valid dataset should be stored. This is only considered if isinstance(dataset, dc.data.DiskDataset) is True.
test_dir (str, optional (default None)) – If specified, the directory in which the generated test dataset should be stored. This is only considered if isinstance(dataset, dc.data.DiskDataset) is True.
frac_train (float, optional (default 0.8)) – The fraction of data to be used for the training split.
frac_valid (float, optional (default 0.1)) – The fraction of data to be used for the validation split.
frac_test (float, optional (default 0.1)) – The fraction of data to be used for the test split.
seed (int, optional (default None)) – Random seed to use.
log_every_n (int, optional (default 1000)) – Controls the logger by dictating how often logger outputs will be produced.
- Returns
A tuple of train, valid and test datasets as dc.data.Dataset objects.
- Return type
MolecularWeightSplitter¶
- class MolecularWeightSplitter[source]¶
Class for doing data splits by molecular weight.
Note
This class requires RDKit to be installed.
Examples
>>> import deepchem as dc >>> import numpy as np >>> # creation of demo data set with some smiles strings >>> smiles= ['C', 'CC', 'CCC', 'CCCC', 'CCCCC'] >>> Xs = np.zeros(len(smiles)) >>> # creation of a deepchem dataset with the smile codes in the ids field >>> dataset = dc.data.DiskDataset.from_numpy(X=Xs,ids=smiles) >>> molecularweightsplitter = dc.splits.MolecularWeightSplitter() >>> train_dataset, test_dataset = molecularweightsplitter.train_test_split(dataset) >>> print(train_dataset.ids) ['C' 'CC' 'CCC' 'CCCC'] >>> print(test_dataset.ids) ['CCCCC']
- split(dataset: deepchem.data.datasets.Dataset, frac_train: float = 0.8, frac_valid: float = 0.1, frac_test: float = 0.1, seed: Optional[int] = None, log_every_n: Optional[int] = None) Tuple[numpy.ndarray, numpy.ndarray, numpy.ndarray] [source]¶
Splits on molecular weight.
Splits internal compounds into train/validation/test using the MW calculated by SMILES string.
- Parameters
dataset (Dataset) – Dataset to be split.
frac_train (float, optional (default 0.8)) – The fraction of data to be used for the training split.
frac_valid (float, optional (default 0.1)) – The fraction of data to be used for the validation split.
frac_test (float, optional (default 0.1)) – The fraction of data to be used for the test split.
seed (int, optional (default None)) – Random seed to use.
log_every_n (int, optional (default None)) – Log every n examples (not currently used).
- Returns
A tuple of train indices, valid indices, and test indices. Each indices is a numpy array.
- Return type
Tuple[np.ndarray, np.ndarray, np.ndarray]
- __repr__() str [source]¶
Convert self to repr representation.
- Returns
The string represents the class.
- Return type
str
Examples
>>> import deepchem as dc >>> dc.splits.RandomSplitter() RandomSplitter[]
- __str__() str [source]¶
Convert self to str representation.
- Returns
The string represents the class.
- Return type
str
Examples
>>> import deepchem as dc >>> str(dc.splits.RandomSplitter()) 'RandomSplitter'
- k_fold_split(dataset: deepchem.data.datasets.Dataset, k: int, directories: Optional[List[str]] = None, **kwargs) List[Tuple[deepchem.data.datasets.Dataset, deepchem.data.datasets.Dataset]] [source]¶
- Parameters
dataset (Dataset) – Dataset to do a k-fold split
k (int) – Number of folds to split dataset into.
directories (List[str], optional (default None)) – List of length 2*k filepaths to save the result disk-datasets.
- Returns
List of length k tuples of (train, cv) where train and cv are both Dataset.
- Return type
- train_test_split(dataset: deepchem.data.datasets.Dataset, train_dir: Optional[str] = None, test_dir: Optional[str] = None, frac_train: float = 0.8, seed: Optional[int] = None, **kwargs) Tuple[deepchem.data.datasets.Dataset, deepchem.data.datasets.Dataset] [source]¶
Splits self into train/test sets.
Returns Dataset objects for train/test.
- Parameters
dataset (data like object) – Dataset to be split.
train_dir (str, optional (default None)) – If specified, the directory in which the generated training dataset should be stored. This is only considered if isinstance(dataset, dc.data.DiskDataset) is True.
test_dir (str, optional (default None)) – If specified, the directory in which the generated test dataset should be stored. This is only considered if isinstance(dataset, dc.data.DiskDataset) is True.
frac_train (float, optional (default 0.8)) – The fraction of data to be used for the training split.
seed (int, optional (default None)) – Random seed to use.
- Returns
A tuple of train and test datasets as dc.data.Dataset objects.
- Return type
- train_valid_test_split(dataset: deepchem.data.datasets.Dataset, train_dir: Optional[str] = None, valid_dir: Optional[str] = None, test_dir: Optional[str] = None, frac_train: float = 0.8, frac_valid: float = 0.1, frac_test: float = 0.1, seed: Optional[int] = None, log_every_n: int = 1000, **kwargs) Tuple[deepchem.data.datasets.Dataset, deepchem.data.datasets.Dataset, deepchem.data.datasets.Dataset] [source]¶
Splits self into train/validation/test sets.
Returns Dataset objects for train, valid, test.
- Parameters
dataset (Dataset) – Dataset to be split.
train_dir (str, optional (default None)) – If specified, the directory in which the generated training dataset should be stored. This is only considered if isinstance(dataset, dc.data.DiskDataset)
valid_dir (str, optional (default None)) – If specified, the directory in which the generated valid dataset should be stored. This is only considered if isinstance(dataset, dc.data.DiskDataset) is True.
test_dir (str, optional (default None)) – If specified, the directory in which the generated test dataset should be stored. This is only considered if isinstance(dataset, dc.data.DiskDataset) is True.
frac_train (float, optional (default 0.8)) – The fraction of data to be used for the training split.
frac_valid (float, optional (default 0.1)) – The fraction of data to be used for the validation split.
frac_test (float, optional (default 0.1)) – The fraction of data to be used for the test split.
seed (int, optional (default None)) – Random seed to use.
log_every_n (int, optional (default 1000)) – Controls the logger by dictating how often logger outputs will be produced.
- Returns
A tuple of train, valid and test datasets as dc.data.Dataset objects.
- Return type
MaxMinSplitter¶
- class MaxMinSplitter[source]¶
Chemical diversity splitter.
Class for doing splits based on the MaxMin diversity algorithm. Intuitively, the test set is comprised of the most diverse compounds of the entire dataset. Furthermore, the validation set is comprised of diverse compounds under the test set.
Note
This class requires RDKit to be installed.
Examples
>>> import deepchem as dc >>> import numpy as np >>> # creation of demo data set with some smiles strings >>> smiles= ['C', 'CC', 'CCC', 'CCCC', 'CCCCC'] >>> Xs = np.zeros(len(smiles)) >>> # creation of a deepchem dataset with the smile codes in the ids field >>> dataset = dc.data.DiskDataset.from_numpy(X=Xs,ids=smiles) >>> maxminsplitter = dc.splits.MaxMinSplitter() >>> train_dataset, test_dataset = maxminsplitter.train_test_split(dataset)
- split(dataset: deepchem.data.datasets.Dataset, frac_train: float = 0.8, frac_valid: float = 0.1, frac_test: float = 0.1, seed: Optional[int] = None, log_every_n: Optional[int] = None) Tuple[List[int], List[int], List[int]] [source]¶
Splits internal compounds into train/validation/test using the MaxMin diversity algorithm.
- Parameters
dataset (Dataset) – Dataset to be split.
frac_train (float, optional (default 0.8)) – The fraction of data to be used for the training split.
frac_valid (float, optional (default 0.1)) – The fraction of data to be used for the validation split.
frac_test (float, optional (default 0.1)) – The fraction of data to be used for the test split.
seed (int, optional (default None)) – Random seed to use.
log_every_n (int, optional (default None)) – Log every n examples (not currently used).
- Returns
A tuple of train indices, valid indices, and test indices. Each indices is a list of integers.
- Return type
Tuple[List[int], List[int], List[int]]
- __repr__() str [source]¶
Convert self to repr representation.
- Returns
The string represents the class.
- Return type
str
Examples
>>> import deepchem as dc >>> dc.splits.RandomSplitter() RandomSplitter[]
- __str__() str [source]¶
Convert self to str representation.
- Returns
The string represents the class.
- Return type
str
Examples
>>> import deepchem as dc >>> str(dc.splits.RandomSplitter()) 'RandomSplitter'
- k_fold_split(dataset: deepchem.data.datasets.Dataset, k: int, directories: Optional[List[str]] = None, **kwargs) List[Tuple[deepchem.data.datasets.Dataset, deepchem.data.datasets.Dataset]] [source]¶
- Parameters
dataset (Dataset) – Dataset to do a k-fold split
k (int) – Number of folds to split dataset into.
directories (List[str], optional (default None)) – List of length 2*k filepaths to save the result disk-datasets.
- Returns
List of length k tuples of (train, cv) where train and cv are both Dataset.
- Return type
- train_test_split(dataset: deepchem.data.datasets.Dataset, train_dir: Optional[str] = None, test_dir: Optional[str] = None, frac_train: float = 0.8, seed: Optional[int] = None, **kwargs) Tuple[deepchem.data.datasets.Dataset, deepchem.data.datasets.Dataset] [source]¶
Splits self into train/test sets.
Returns Dataset objects for train/test.
- Parameters
dataset (data like object) – Dataset to be split.
train_dir (str, optional (default None)) – If specified, the directory in which the generated training dataset should be stored. This is only considered if isinstance(dataset, dc.data.DiskDataset) is True.
test_dir (str, optional (default None)) – If specified, the directory in which the generated test dataset should be stored. This is only considered if isinstance(dataset, dc.data.DiskDataset) is True.
frac_train (float, optional (default 0.8)) – The fraction of data to be used for the training split.
seed (int, optional (default None)) – Random seed to use.
- Returns
A tuple of train and test datasets as dc.data.Dataset objects.
- Return type
- train_valid_test_split(dataset: deepchem.data.datasets.Dataset, train_dir: Optional[str] = None, valid_dir: Optional[str] = None, test_dir: Optional[str] = None, frac_train: float = 0.8, frac_valid: float = 0.1, frac_test: float = 0.1, seed: Optional[int] = None, log_every_n: int = 1000, **kwargs) Tuple[deepchem.data.datasets.Dataset, deepchem.data.datasets.Dataset, deepchem.data.datasets.Dataset] [source]¶
Splits self into train/validation/test sets.
Returns Dataset objects for train, valid, test.
- Parameters
dataset (Dataset) – Dataset to be split.
train_dir (str, optional (default None)) – If specified, the directory in which the generated training dataset should be stored. This is only considered if isinstance(dataset, dc.data.DiskDataset)
valid_dir (str, optional (default None)) – If specified, the directory in which the generated valid dataset should be stored. This is only considered if isinstance(dataset, dc.data.DiskDataset) is True.
test_dir (str, optional (default None)) – If specified, the directory in which the generated test dataset should be stored. This is only considered if isinstance(dataset, dc.data.DiskDataset) is True.
frac_train (float, optional (default 0.8)) – The fraction of data to be used for the training split.
frac_valid (float, optional (default 0.1)) – The fraction of data to be used for the validation split.
frac_test (float, optional (default 0.1)) – The fraction of data to be used for the test split.
seed (int, optional (default None)) – Random seed to use.
log_every_n (int, optional (default 1000)) – Controls the logger by dictating how often logger outputs will be produced.
- Returns
A tuple of train, valid and test datasets as dc.data.Dataset objects.
- Return type
ButinaSplitter¶
- class ButinaSplitter(cutoff: float = 0.6)[source]¶
Class for doing data splits based on the butina clustering of a bulk tanimoto fingerprint matrix.
Note
This class requires RDKit to be installed.
Examples
>>> import deepchem as dc >>> import numpy as np >>> # creation of demo data set with some smiles strings >>> smiles= ['C', 'CC', 'CCC', 'CCCC', 'CCCCC'] >>> Xs = np.zeros(len(smiles)) >>> # creation of a deepchem dataset with the smile codes in the ids field >>> dataset = dc.data.DiskDataset.from_numpy(X=Xs,ids=smiles) >>> butinasplitter = dc.splits.ButinaSplitter() >>> train_dataset, test_dataset = butinasplitter.train_test_split(dataset) >>> print(train_dataset.ids) ['CCCC' 'CCC' 'CCCCC' 'CC'] >>> print(test_dataset.ids) ['C']
- __init__(cutoff: float = 0.6)[source]¶
Create a ButinaSplitter.
- Parameters
cutoff (float (default 0.6)) – The cutoff value for tanimoto similarity. Molecules that are more similar than this will tend to be put in the same dataset.
- split(dataset: deepchem.data.datasets.Dataset, frac_train: float = 0.8, frac_valid: float = 0.1, frac_test: float = 0.1, seed: Optional[int] = None, log_every_n: Optional[int] = None) Tuple[List[int], List[int], List[int]] [source]¶
Splits internal compounds into train and validation based on the butina clustering algorithm. This splitting algorithm has an O(N^2) run time, where N is the number of elements in the dataset. The dataset is expected to be a classification dataset.
This algorithm is designed to generate validation data that are novel chemotypes. Setting a small cutoff value will generate smaller, finer clusters of high similarity, whereas setting a large cutoff value will generate larger, coarser clusters of low similarity.
- Parameters
dataset (Dataset) – Dataset to be split.
frac_train (float, optional (default 0.8)) – The fraction of data to be used for the training split.
frac_valid (float, optional (default 0.1)) – The fraction of data to be used for the validation split.
frac_test (float, optional (default 0.1)) – The fraction of data to be used for the test split.
seed (int, optional (default None)) – Random seed to use.
log_every_n (int, optional (default None)) – Log every n examples (not currently used).
- Returns
A tuple of train indices, valid indices, and test indices.
- Return type
Tuple[List[int], List[int], List[int]]
- k_fold_split(dataset: deepchem.data.datasets.Dataset, k: int, directories: Optional[List[str]] = None, **kwargs) List[Tuple[deepchem.data.datasets.Dataset, deepchem.data.datasets.Dataset]] [source]¶
- Parameters
dataset (Dataset) – Dataset to do a k-fold split
k (int) – Number of folds to split dataset into.
directories (List[str], optional (default None)) – List of length 2*k filepaths to save the result disk-datasets.
- Returns
List of length k tuples of (train, cv) where train and cv are both Dataset.
- Return type
- train_test_split(dataset: deepchem.data.datasets.Dataset, train_dir: Optional[str] = None, test_dir: Optional[str] = None, frac_train: float = 0.8, seed: Optional[int] = None, **kwargs) Tuple[deepchem.data.datasets.Dataset, deepchem.data.datasets.Dataset] [source]¶
Splits self into train/test sets.
Returns Dataset objects for train/test.
- Parameters
dataset (data like object) – Dataset to be split.
train_dir (str, optional (default None)) – If specified, the directory in which the generated training dataset should be stored. This is only considered if isinstance(dataset, dc.data.DiskDataset) is True.
test_dir (str, optional (default None)) – If specified, the directory in which the generated test dataset should be stored. This is only considered if isinstance(dataset, dc.data.DiskDataset) is True.
frac_train (float, optional (default 0.8)) – The fraction of data to be used for the training split.
seed (int, optional (default None)) – Random seed to use.
- Returns
A tuple of train and test datasets as dc.data.Dataset objects.
- Return type
- train_valid_test_split(dataset: deepchem.data.datasets.Dataset, train_dir: Optional[str] = None, valid_dir: Optional[str] = None, test_dir: Optional[str] = None, frac_train: float = 0.8, frac_valid: float = 0.1, frac_test: float = 0.1, seed: Optional[int] = None, log_every_n: int = 1000, **kwargs) Tuple[deepchem.data.datasets.Dataset, deepchem.data.datasets.Dataset, deepchem.data.datasets.Dataset] [source]¶
Splits self into train/validation/test sets.
Returns Dataset objects for train, valid, test.
- Parameters
dataset (Dataset) – Dataset to be split.
train_dir (str, optional (default None)) – If specified, the directory in which the generated training dataset should be stored. This is only considered if isinstance(dataset, dc.data.DiskDataset)
valid_dir (str, optional (default None)) – If specified, the directory in which the generated valid dataset should be stored. This is only considered if isinstance(dataset, dc.data.DiskDataset) is True.
test_dir (str, optional (default None)) – If specified, the directory in which the generated test dataset should be stored. This is only considered if isinstance(dataset, dc.data.DiskDataset) is True.
frac_train (float, optional (default 0.8)) – The fraction of data to be used for the training split.
frac_valid (float, optional (default 0.1)) – The fraction of data to be used for the validation split.
frac_test (float, optional (default 0.1)) – The fraction of data to be used for the test split.
seed (int, optional (default None)) – Random seed to use.
log_every_n (int, optional (default 1000)) – Controls the logger by dictating how often logger outputs will be produced.
- Returns
A tuple of train, valid and test datasets as dc.data.Dataset objects.
- Return type
FingerprintSplitter¶
- class FingerprintSplitter[source]¶
Class for doing data splits based on the Tanimoto similarity between ECFP4 fingerprints.
This class tries to split the data such that the molecules in each dataset are as different as possible from the ones in the other datasets. This makes it a very stringent test of models. Predicting the test and validation sets may require extrapolating far outside the training data.
The running time for this splitter scales as O(n^2) in the number of samples. Splitting large datasets can take a long time.
Note
This class requires RDKit to be installed.
Examples
>>> import deepchem as dc >>> import numpy as np >>> # creation of demo data set with some smiles strings >>> smiles= ['C', 'CC', 'CCC', 'CCCC', 'CCCCC'] >>> Xs = np.zeros(len(smiles)) >>> # creation of a deepchem dataset with the smile codes in the ids field >>> dataset = dc.data.DiskDataset.from_numpy(X=Xs,ids=smiles) >>> fingerprintsplitter = dc.splits.FingerprintSplitter() >>> train_dataset, test_dataset = fingerprintsplitter.train_test_split(dataset) >>> print(train_dataset.ids) ['C' 'CCCCC' 'CCCC' 'CCC'] >>> print(test_dataset.ids) ['CC']
- split(dataset: deepchem.data.datasets.Dataset, frac_train: float = 0.8, frac_valid: float = 0.1, frac_test: float = 0.1, seed: Optional[int] = None, log_every_n: Optional[int] = None) Tuple[List[int], List[int], List[int]] [source]¶
Splits compounds into training, validation, and test sets based on the Tanimoto similarity of their ECFP4 fingerprints. This splitting algorithm has an O(N^2) run time, where N is the number of elements in the dataset.
- Parameters
dataset (Dataset) – Dataset to be split.
frac_train (float, optional (default 0.8)) – The fraction of data to be used for the training split.
frac_valid (float, optional (default 0.1)) – The fraction of data to be used for the validation split.
frac_test (float, optional (default 0.1)) – The fraction of data to be used for the test split.
seed (int, optional (default None)) – Random seed to use (ignored since this algorithm is deterministic).
log_every_n (int, optional (default None)) – Log every n examples (not currently used).
- Returns
A tuple of train indices, valid indices, and test indices.
- Return type
Tuple[List[int], List[int], List[int]]
- __repr__() str [source]¶
Convert self to repr representation.
- Returns
The string represents the class.
- Return type
str
Examples
>>> import deepchem as dc >>> dc.splits.RandomSplitter() RandomSplitter[]
- __str__() str [source]¶
Convert self to str representation.
- Returns
The string represents the class.
- Return type
str
Examples
>>> import deepchem as dc >>> str(dc.splits.RandomSplitter()) 'RandomSplitter'
- k_fold_split(dataset: deepchem.data.datasets.Dataset, k: int, directories: Optional[List[str]] = None, **kwargs) List[Tuple[deepchem.data.datasets.Dataset, deepchem.data.datasets.Dataset]] [source]¶
- Parameters
dataset (Dataset) – Dataset to do a k-fold split
k (int) – Number of folds to split dataset into.
directories (List[str], optional (default None)) – List of length 2*k filepaths to save the result disk-datasets.
- Returns
List of length k tuples of (train, cv) where train and cv are both Dataset.
- Return type
- train_test_split(dataset: deepchem.data.datasets.Dataset, train_dir: Optional[str] = None, test_dir: Optional[str] = None, frac_train: float = 0.8, seed: Optional[int] = None, **kwargs) Tuple[deepchem.data.datasets.Dataset, deepchem.data.datasets.Dataset] [source]¶
Splits self into train/test sets.
Returns Dataset objects for train/test.
- Parameters
dataset (data like object) – Dataset to be split.
train_dir (str, optional (default None)) – If specified, the directory in which the generated training dataset should be stored. This is only considered if isinstance(dataset, dc.data.DiskDataset) is True.
test_dir (str, optional (default None)) – If specified, the directory in which the generated test dataset should be stored. This is only considered if isinstance(dataset, dc.data.DiskDataset) is True.
frac_train (float, optional (default 0.8)) – The fraction of data to be used for the training split.
seed (int, optional (default None)) – Random seed to use.
- Returns
A tuple of train and test datasets as dc.data.Dataset objects.
- Return type
- train_valid_test_split(dataset: deepchem.data.datasets.Dataset, train_dir: Optional[str] = None, valid_dir: Optional[str] = None, test_dir: Optional[str] = None, frac_train: float = 0.8, frac_valid: float = 0.1, frac_test: float = 0.1, seed: Optional[int] = None, log_every_n: int = 1000, **kwargs) Tuple[deepchem.data.datasets.Dataset, deepchem.data.datasets.Dataset, deepchem.data.datasets.Dataset] [source]¶
Splits self into train/validation/test sets.
Returns Dataset objects for train, valid, test.
- Parameters
dataset (Dataset) – Dataset to be split.
train_dir (str, optional (default None)) – If specified, the directory in which the generated training dataset should be stored. This is only considered if isinstance(dataset, dc.data.DiskDataset)
valid_dir (str, optional (default None)) – If specified, the directory in which the generated valid dataset should be stored. This is only considered if isinstance(dataset, dc.data.DiskDataset) is True.
test_dir (str, optional (default None)) – If specified, the directory in which the generated test dataset should be stored. This is only considered if isinstance(dataset, dc.data.DiskDataset) is True.
frac_train (float, optional (default 0.8)) – The fraction of data to be used for the training split.
frac_valid (float, optional (default 0.1)) – The fraction of data to be used for the validation split.
frac_test (float, optional (default 0.1)) – The fraction of data to be used for the test split.
seed (int, optional (default None)) – Random seed to use.
log_every_n (int, optional (default 1000)) – Controls the logger by dictating how often logger outputs will be produced.
- Returns
A tuple of train, valid and test datasets as dc.data.Dataset objects.
- Return type
Base Splitter (for develop)¶
The dc.splits.Splitter
class is the abstract parent class for
all splitters. This class should never be directly instantiated.
- class Splitter[source]¶
Splitters split up Datasets into pieces for training/validation/testing.
In machine learning applications, it’s often necessary to split up a dataset into training/validation/test sets. Or to k-fold split a dataset (that is, divide into k equal subsets) for cross-validation. The Splitter class is an abstract superclass for all splitters that captures the common API across splitter classes.
Note that Splitter is an abstract superclass. You won’t want to instantiate this class directly. Rather you will want to use a concrete subclass for your application.
- k_fold_split(dataset: deepchem.data.datasets.Dataset, k: int, directories: Optional[List[str]] = None, **kwargs) List[Tuple[deepchem.data.datasets.Dataset, deepchem.data.datasets.Dataset]] [source]¶
- Parameters
dataset (Dataset) – Dataset to do a k-fold split
k (int) – Number of folds to split dataset into.
directories (List[str], optional (default None)) – List of length 2*k filepaths to save the result disk-datasets.
- Returns
List of length k tuples of (train, cv) where train and cv are both Dataset.
- Return type
- train_valid_test_split(dataset: deepchem.data.datasets.Dataset, train_dir: Optional[str] = None, valid_dir: Optional[str] = None, test_dir: Optional[str] = None, frac_train: float = 0.8, frac_valid: float = 0.1, frac_test: float = 0.1, seed: Optional[int] = None, log_every_n: int = 1000, **kwargs) Tuple[deepchem.data.datasets.Dataset, deepchem.data.datasets.Dataset, deepchem.data.datasets.Dataset] [source]¶
Splits self into train/validation/test sets.
Returns Dataset objects for train, valid, test.
- Parameters
dataset (Dataset) – Dataset to be split.
train_dir (str, optional (default None)) – If specified, the directory in which the generated training dataset should be stored. This is only considered if isinstance(dataset, dc.data.DiskDataset)
valid_dir (str, optional (default None)) – If specified, the directory in which the generated valid dataset should be stored. This is only considered if isinstance(dataset, dc.data.DiskDataset) is True.
test_dir (str, optional (default None)) – If specified, the directory in which the generated test dataset should be stored. This is only considered if isinstance(dataset, dc.data.DiskDataset) is True.
frac_train (float, optional (default 0.8)) – The fraction of data to be used for the training split.
frac_valid (float, optional (default 0.1)) – The fraction of data to be used for the validation split.
frac_test (float, optional (default 0.1)) – The fraction of data to be used for the test split.
seed (int, optional (default None)) – Random seed to use.
log_every_n (int, optional (default 1000)) – Controls the logger by dictating how often logger outputs will be produced.
- Returns
A tuple of train, valid and test datasets as dc.data.Dataset objects.
- Return type
- train_test_split(dataset: deepchem.data.datasets.Dataset, train_dir: Optional[str] = None, test_dir: Optional[str] = None, frac_train: float = 0.8, seed: Optional[int] = None, **kwargs) Tuple[deepchem.data.datasets.Dataset, deepchem.data.datasets.Dataset] [source]¶
Splits self into train/test sets.
Returns Dataset objects for train/test.
- Parameters
dataset (data like object) – Dataset to be split.
train_dir (str, optional (default None)) – If specified, the directory in which the generated training dataset should be stored. This is only considered if isinstance(dataset, dc.data.DiskDataset) is True.
test_dir (str, optional (default None)) – If specified, the directory in which the generated test dataset should be stored. This is only considered if isinstance(dataset, dc.data.DiskDataset) is True.
frac_train (float, optional (default 0.8)) – The fraction of data to be used for the training split.
seed (int, optional (default None)) – Random seed to use.
- Returns
A tuple of train and test datasets as dc.data.Dataset objects.
- Return type
- split(dataset: deepchem.data.datasets.Dataset, frac_train: float = 0.8, frac_valid: float = 0.1, frac_test: float = 0.1, seed: Optional[int] = None, log_every_n: Optional[int] = None) Tuple [source]¶
Return indices for specified split
- Parameters
dataset (dc.data.Dataset) – Dataset to be split.
seed (int, optional (default None)) – Random seed to use.
frac_train (float, optional (default 0.8)) – The fraction of data to be used for the training split.
frac_valid (float, optional (default 0.1)) – The fraction of data to be used for the validation split.
frac_test (float, optional (default 0.1)) – The fraction of data to be used for the test split.
log_every_n (int, optional (default None)) – Controls the logger by dictating how often logger outputs will be produced.
- Returns
A tuple (train_inds, valid_inds, test_inds) of the indices (integers) for the various splits.
- Return type
Tuple