Data

DeepChem dc.data provides APIs for handling your data.

If your data is stored by the file like CSV and SDF, you can use the Data Loaders. The Data Loaders read your data, convert them to features (ex: SMILES to ECFP) and save the features to Dataset class. If your data is python objects like Numpy arrays or Pandas DataFrames, you can use the Datasets directly.

Datasets

DeepChem dc.data.Dataset objects are one of the core building blocks of DeepChem programs. Dataset objects hold representations of data for machine learning and are widely used throughout DeepChem.

The goal of the Dataset class is to be maximally interoperable with other common representations of machine learning datasets. For this reason we provide interconversion methods mapping from Dataset objects to pandas DataFrames, TensorFlow Datasets, and PyTorch datasets.

NumpyDataset

The dc.data.NumpyDataset class provides an in-memory implementation of the abstract Dataset which stores its data in numpy.ndarray objects.

class NumpyDataset(X: _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | bool | int | float | complex | str | bytes | _NestedSequence[bool | int | float | complex | str | bytes], y: _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | bool | int | float | complex | str | bytes | _NestedSequence[bool | int | float | complex | str | bytes] | None = None, w: _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | bool | int | float | complex | str | bytes | _NestedSequence[bool | int | float | complex | str | bytes] | None = None, ids: _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | bool | int | float | complex | str | bytes | _NestedSequence[bool | int | float | complex | str | bytes] | None = None, n_tasks: int = 1)[source]

A Dataset defined by in-memory numpy arrays.

This subclass of Dataset stores arrays X,y,w,ids in memory as numpy arrays. This makes it very easy to construct NumpyDataset objects.

Examples

>>> import numpy as np
>>> dataset = NumpyDataset(X=np.random.rand(5, 3), y=np.random.rand(5,), ids=np.arange(5))
__init__(X: _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | bool | int | float | complex | str | bytes | _NestedSequence[bool | int | float | complex | str | bytes], y: _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | bool | int | float | complex | str | bytes | _NestedSequence[bool | int | float | complex | str | bytes] | None = None, w: _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | bool | int | float | complex | str | bytes | _NestedSequence[bool | int | float | complex | str | bytes] | None = None, ids: _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | bool | int | float | complex | str | bytes | _NestedSequence[bool | int | float | complex | str | bytes] | None = None, n_tasks: int = 1) None[source]

Initialize this object.

Parameters:
  • X (np.ndarray) – Input features. A numpy array of shape (n_samples,…).

  • y (np.ndarray, optional (default None)) – Labels. A numpy array of shape (n_samples, …). Note that each label can have an arbitrary shape.

  • w (np.ndarray, optional (default None)) – Weights. Should either be 1D array of shape (n_samples,) or if there’s more than one task, of shape (n_samples, n_tasks).

  • ids (np.ndarray, optional (default None)) – Identifiers. A numpy array of shape (n_samples,)

  • n_tasks (int, default 1) – Number of learning tasks.

__len__() int[source]

Get the number of elements in the dataset.

get_shape() Tuple[Tuple[int, ...], Tuple[int, ...], Tuple[int, ...], Tuple[int, ...]][source]

Get the shape of the dataset.

Returns four tuples, giving the shape of the X, y, w, and ids arrays.

get_task_names() ndarray[source]

Get the names of the tasks associated with this dataset.

property X: ndarray[source]

Get the X vector for this dataset as a single numpy array.

property y: ndarray[source]

Get the y vector for this dataset as a single numpy array.

property ids: ndarray[source]

Get the ids vector for this dataset as a single numpy array.

property w: ndarray[source]

Get the weight vector for this dataset as a single numpy array.

iterbatches(batch_size: int | None = None, epochs: int = 1, deterministic: bool = False, pad_batches: bool = False) Iterator[Tuple[ndarray, ndarray, ndarray, ndarray]][source]

Get an object that iterates over minibatches from the dataset.

Each minibatch is returned as a tuple of four numpy arrays: (X, y, w, ids).

Parameters:
  • batch_size (int, optional (default None)) – Number of elements in each batch.

  • epochs (int, default 1) – Number of epochs to walk over dataset.

  • deterministic (bool, optional (default False)) – If True, follow deterministic order.

  • pad_batches (bool, optional (default False)) – If True, pad each batch to batch_size.

Returns:

Generator which yields tuples of four numpy arrays (X, y, w, ids).

Return type:

Iterator[Batch]

itersamples() Iterator[Tuple[ndarray, ndarray, ndarray, ndarray]][source]

Get an object that iterates over the samples in the dataset.

Returns:

Iterator which yields tuples of four numpy arrays (X, y, w, ids).

Return type:

Iterator[Batch]

Examples

>>> dataset = NumpyDataset(np.ones((2,2)))
>>> for x, y, w, id in dataset.itersamples():
...   print(x.tolist(), y.tolist(), w.tolist(), id)
[1.0, 1.0] [0.0] [0.0] 0
[1.0, 1.0] [0.0] [0.0] 1
transform(transformer: Transformer, **args) NumpyDataset[source]

Construct a new dataset by applying a transformation to every sample in this dataset.

The argument is a function that can be called as follows: >> newx, newy, neww = fn(x, y, w)

It might be called only once with the whole dataset, or multiple times with different subsets of the data. Each time it is called, it should transform the samples and return the transformed data.

Parameters:

transformer (dc.trans.Transformer) – The transformation to apply to each sample in the dataset

Returns:

A newly constructed NumpyDataset object

Return type:

NumpyDataset

select(indices: Sequence[int] | ndarray, select_dir: str | None = None) NumpyDataset[source]

Creates a new dataset from a selection of indices from self.

Parameters:
  • indices (List[int]) – List of indices to select.

  • select_dir (str, optional (default None)) – Used to provide same API as DiskDataset. Ignored since NumpyDataset is purely in-memory.

Returns:

A selected NumpyDataset object

Return type:

NumpyDataset

make_pytorch_dataset(epochs: int = 1, deterministic: bool = False, batch_size: int | None = None)[source]

Create a torch.utils.data.IterableDataset that iterates over the data in this Dataset.

Each value returned by the Dataset’s iterator is a tuple of (X, y, w, id) containing the data for one batch, or for a single sample if batch_size is None.

Parameters:
  • epochs (int, default 1) – The number of times to iterate over the Dataset

  • deterministic (bool, default False) – If True, the data is produced in order. If False, a different random permutation of the data is used for each epoch.

  • batch_size (int, optional (default None)) – The number of samples to return in each batch. If None, each returned value is a single sample.

Returns:

torch.utils.data.IterableDataset that iterates over the data in this dataset.

Return type:

torch.utils.data.IterableDataset

Note

This method requires PyTorch to be installed.

static from_DiskDataset(ds: DiskDataset) NumpyDataset[source]

Convert DiskDataset to NumpyDataset.

Parameters:

ds (DiskDataset) – DiskDataset to transform to NumpyDataset.

Returns:

A new NumpyDataset created from DiskDataset.

Return type:

NumpyDataset

to_json(fname: str) None[source]

Dump NumpyDataset to the json file .

Parameters:

fname (str) – The name of the json file.

static from_json(fname: str) NumpyDataset[source]

Create NumpyDataset from the json file.

Parameters:

fname (str) – The name of the json file.

Returns:

A new NumpyDataset created from the json file.

Return type:

NumpyDataset

static merge(datasets: Sequence[Dataset]) NumpyDataset[source]

Merge multiple NumpyDatasets.

Parameters:

datasets (List[Dataset]) – List of datasets to merge.

Returns:

A single NumpyDataset containing all the samples from all datasets.

Return type:

NumpyDataset

Example

>>> X1, y1 = np.random.rand(5, 3), np.random.randn(5, 1)
>>> first_dataset = dc.data.NumpyDataset(X1, y1)
>>> X2, y2 = np.random.rand(5, 3), np.random.randn(5, 1)
>>> second_dataset = dc.data.NumpyDataset(X2, y2)
>>> merged_dataset = dc.data.NumpyDataset.merge([first_dataset, second_dataset])
>>> print(len(merged_dataset) == len(first_dataset) + len(second_dataset))
True
static from_dataframe(df: DataFrame, X: str | Sequence[str] | None = None, y: str | Sequence[str] | None = None, w: str | Sequence[str] | None = None, ids: str | None = None)[source]

Construct a Dataset from the contents of a pandas DataFrame.

Parameters:
  • df (pd.DataFrame) – The pandas DataFrame

  • X (str or List[str], optional (default None)) – The name of the column or columns containing the X array. If this is None, it will look for default column names that match those produced by to_dataframe().

  • y (str or List[str], optional (default None)) – The name of the column or columns containing the y array. If this is None, it will look for default column names that match those produced by to_dataframe().

  • w (str or List[str], optional (default None)) – The name of the column or columns containing the w array. If this is None, it will look for default column names that match those produced by to_dataframe().

  • ids (str, optional (default None)) – The name of the column containing the ids. If this is None, it will look for default column names that match those produced by to_dataframe().

get_statistics(X_stats: bool = True, y_stats: bool = True) Tuple[ndarray, ...][source]

Compute and return statistics of this dataset.

Uses self.itersamples() to compute means and standard deviations of the dataset. Can compute on large datasets that don’t fit in memory.

Parameters:
  • X_stats (bool, optional (default True)) – If True, compute feature-level mean and standard deviations.

  • y_stats (bool, optional (default True)) – If True, compute label-level mean and standard deviations.

Returns:

  • If X_stats == True, returns (X_means, X_stds).

  • If y_stats == True, returns (y_means, y_stds).

  • If both are true, returns (X_means, X_stds, y_means, y_stds).

Return type:

Tuple

make_tf_dataset(batch_size: int = 100, epochs: int = 1, deterministic: bool = False, pad_batches: bool = False)[source]

Create a tf.data.Dataset that iterates over the data in this Dataset.

Each value returned by the Dataset’s iterator is a tuple of (X, y, w) for one batch.

Parameters:
  • batch_size (int, default 100) – The number of samples to include in each batch.

  • epochs (int, default 1) – The number of times to iterate over the Dataset.

  • deterministic (bool, default False) – If True, the data is produced in order. If False, a different random permutation of the data is used for each epoch.

  • pad_batches (bool, default False) – If True, batches are padded as necessary to make the size of each batch exactly equal batch_size.

Returns:

TensorFlow Dataset that iterates over the same data.

Return type:

tf.data.Dataset

Note

This class requires TensorFlow to be installed.

to_csv(path: str) None[source]

Write object to a comma-seperated values (CSV) file

Example

>>> import numpy as np
>>> X = np.random.rand(10, 10)
>>> dataset = dc.data.DiskDataset.from_numpy(X)
>>> dataset.to_csv('out.csv')  
Parameters:

path (str) – File path or object

Return type:

None

to_dataframe() DataFrame[source]

Construct a pandas DataFrame containing the data from this Dataset.

Returns:

Pandas dataframe. If there is only a single feature per datapoint, will have column “X” else will have columns “X1,X2,…” for features. If there is only a single label per datapoint, will have column “y” else will have columns “y1,y2,…” for labels. If there is only a single weight per datapoint will have column “w” else will have columns “w1,w2,…”. Will have column “ids” for identifiers.

Return type:

pd.DataFrame

DiskDataset

The dc.data.DiskDataset class allows for the storage of larger datasets on disk. Each DiskDataset is associated with a directory in which it writes its contents to disk. Note that a DiskDataset can be very large, so some of the utility methods to access fields of a Dataset can be prohibitively expensive.

class DiskDataset(data_dir: str)[source]

A Dataset that is stored as a set of files on disk.

The DiskDataset is the workhorse class of DeepChem that facilitates analyses on large datasets. Use this class whenever you’re working with a large dataset that can’t be easily manipulated in RAM.

On disk, a DiskDataset has a simple structure. All files for a given DiskDataset are stored in a data_dir. The contents of data_dir should be laid out as follows:

data_dir/
—> metadata.csv.gzip
—> tasks.json
—> shard-0-X.npy
—> shard-0-y.npy
—> shard-0-w.npy
—> shard-0-ids.npy
—> shard-1-X.npy
.
.
.

The metadata is constructed by static method DiskDataset._construct_metadata and saved to disk by DiskDataset._save_metadata. The metadata itself consists of a csv file which has columns (‘ids’, ‘X’, ‘y’, ‘w’, ‘ids_shape’, ‘X_shape’, ‘y_shape’, ‘w_shape’). tasks.json consists of a list of task names for this dataset.

The actual data is stored in .npy files (numpy array files) of the form ‘shard-0-X.npy’, ‘shard-0-y.npy’, etc.

The basic structure of DiskDataset is quite robust and will likely serve you well for datasets up to about 100 GB or larger. However note that DiskDataset has not been tested for very large datasets at the terabyte range and beyond. You may be better served by implementing a custom Dataset class for those use cases.

Examples

Let’s walk through a simple example of constructing a new DiskDataset.

>>> import deepchem as dc
>>> import numpy as np
>>> X = np.random.rand(10, 10)
>>> dataset = dc.data.DiskDataset.from_numpy(X)

If you have already saved a DiskDataset to data_dir, you can reinitialize it with

>> data_dir = “/path/to/my/data” >> dataset = dc.data.DiskDataset(data_dir)

Once you have a dataset you can access its attributes as follows

>>> X = np.random.rand(10, 10)
>>> y = np.random.rand(10,)
>>> w = np.ones_like(y)
>>> dataset = dc.data.DiskDataset.from_numpy(X)
>>> X, y, w = dataset.X, dataset.y, dataset.w

One thing to beware of is that dataset.X, dataset.y, dataset.w are loading data from disk! If you have a large dataset, these operations can be extremely slow. Instead try iterating through the dataset instead.

>>> for (xi, yi, wi, idi) in dataset.itersamples():
...   pass
data_dir[source]

Location of directory where this DiskDataset is stored to disk

Type:

str

metadata_df[source]

Pandas Dataframe holding metadata for this DiskDataset

Type:

pd.DataFrame

legacy_metadata[source]

Whether this DiskDataset uses legacy format.

Type:

bool

Note

DiskDataset originally had a simpler metadata format without shape information. Older DiskDataset objects had metadata files with columns (‘ids’, ‘X’, ‘y’, ‘w’) and not additional shape columns. DiskDataset maintains backwards compatibility with this older metadata format, but we recommend for performance reasons not using legacy metadata for new projects.

__init__(data_dir: str) None[source]

Load a constructed DiskDataset from disk

Note that this method cannot construct a new disk dataset. Instead use static methods DiskDataset.create_dataset or DiskDataset.from_numpy for that purpose. Use this constructor instead to load a DiskDataset that has already been created on disk.

Parameters:

data_dir (str) – Location on disk of an existing DiskDataset.

static create_dataset(shard_generator: Iterable[Tuple[ndarray, ndarray, ndarray, ndarray]], data_dir: str | None = None, tasks: _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | bool | int | float | complex | str | bytes | _NestedSequence[bool | int | float | complex | str | bytes] | None = None) DiskDataset[source]

Creates a new DiskDataset

Parameters:
  • shard_generator (Iterable[Batch]) – An iterable (either a list or generator) that provides tuples of data (X, y, w, ids). Each tuple will be written to a separate shard on disk.

  • data_dir (str, optional (default None)) – Filename for data directory. Creates a temp directory if none specified.

  • tasks (Sequence, optional (default [])) – List of tasks for this dataset.

Returns:

A new DiskDataset constructed from the given data

Return type:

DiskDataset

load_metadata() Tuple[List[str], DataFrame][source]

Helper method that loads metadata from disk.

static write_data_to_disk(data_dir: str, basename: str, X: ndarray | None = None, y: ndarray | None = None, w: ndarray | None = None, ids: ndarray | None = None) List[Any][source]

Static helper method to write data to disk.

This helper method is used to write a shard of data to disk.

Parameters:
  • data_dir (str) – Data directory to write shard to.

  • basename (str) – Basename for the shard in question.

  • X (np.ndarray, optional (default None)) – The features array.

  • y (np.ndarray, optional (default None)) – The labels array.

  • w (np.ndarray, optional (default None)) – The weights array.

  • ids (np.ndarray, optional (default None)) – The identifiers array.

Returns:

List with values [out_ids, out_X, out_y, out_w, out_ids_shape, out_X_shape, out_y_shape, out_w_shape] with filenames of locations to disk which these respective arrays were written.

Return type:

List[Optional[str]]

save_to_disk() None[source]

Save dataset to disk.

move(new_data_dir: str, delete_if_exists: bool | None = True) None[source]

Moves dataset to new directory.

Parameters:
  • new_data_dir (str) – The new directory name to move this to dataset to.

  • delete_if_exists (bool, optional (default True)) – If this option is set, delete the destination directory if it exists before moving. This is set to True by default to be backwards compatible with behavior in earlier versions of DeepChem.

Note

This is a stateful operation! self.data_dir will be moved into new_data_dir. If delete_if_exists is set to True (by default this is set True), then new_data_dir is deleted if it’s a pre-existing directory.

copy(new_data_dir: str) DiskDataset[source]

Copies dataset to new directory.

Parameters:

new_data_dir (str) – The new directory name to copy this to dataset to.

Returns:

A copied DiskDataset object.

Return type:

DiskDataset

Note

This is a stateful operation! Any data at new_data_dir will be deleted and self.data_dir will be deep copied into new_data_dir.

get_task_names() ndarray[source]

Gets learning tasks associated with this dataset.

reshard(shard_size: int) None[source]

Reshards data to have specified shard size.

Parameters:

shard_size (int) – The size of shard.

Examples

>>> import deepchem as dc
>>> import numpy as np
>>> X = np.random.rand(100, 10)
>>> d = dc.data.DiskDataset.from_numpy(X)
>>> d.reshard(shard_size=10)
>>> d.get_number_shards()
10

Note

If this DiskDataset is in legacy_metadata format, reshard will convert this dataset to have non-legacy metadata.

get_data_shape() Tuple[int, ...][source]

Gets array shape of datapoints in this dataset.

get_shard_size() int[source]

Gets size of shards on disk.

get_number_shards() int[source]

Returns the number of shards for this dataset.

itershards() Iterator[Tuple[ndarray, ndarray, ndarray, ndarray]][source]

Return an object that iterates over all shards in dataset.

Datasets are stored in sharded fashion on disk. Each call to next() for the generator defined by this function returns the data from a particular shard. The order of shards returned is guaranteed to remain fixed.

Returns:

Generator which yields tuples of four numpy arrays (X, y, w, ids).

Return type:

Iterator[Batch]

iterbatches(batch_size: int | None = None, epochs: int = 1, deterministic: bool = False, pad_batches: bool = False) Iterator[Tuple[ndarray, ndarray, ndarray, ndarray]][source]

Get an object that iterates over minibatches from the dataset.

It is guaranteed that the number of batches returned is math.ceil(len(dataset)/batch_size). Each minibatch is returned as a tuple of four numpy arrays: (X, y, w, ids).

Parameters:
  • batch_size (int, optional (default None)) – Number of elements in a batch. If None, then it yields batches with size equal to the size of each individual shard.

  • epoch (int, default 1) – Number of epochs to walk over dataset

  • deterministic (bool, default False) – Whether or not we should should shuffle each shard before generating the batches. Note that this is only local in the sense that it does not ever mix between different shards.

  • pad_batches (bool, default False) – Whether or not we should pad the last batch, globally, such that it has exactly batch_size elements.

Returns:

Generator which yields tuples of four numpy arrays (X, y, w, ids).

Return type:

Iterator[Batch]

itersamples() Iterator[Tuple[ndarray, ndarray, ndarray, ndarray]][source]

Get an object that iterates over the samples in the dataset.

Returns:

Generator which yields tuples of four numpy arrays (X, y, w, ids).

Return type:

Iterator[Batch]

Examples

>>> dataset = DiskDataset.from_numpy(np.ones((2,2)), np.ones((2,1)))
>>> for x, y, w, id in dataset.itersamples():
...   print(x.tolist(), y.tolist(), w.tolist(), id)
[1.0, 1.0] [1.0] [1.0] 0
[1.0, 1.0] [1.0] [1.0] 1
transform(transformer: Transformer, parallel: bool = False, out_dir: str | None = None, **args) DiskDataset[source]

Construct a new dataset by applying a transformation to every sample in this dataset.

The argument is a function that can be called as follows: >> newx, newy, neww = fn(x, y, w)

It might be called only once with the whole dataset, or multiple times with different subsets of the data. Each time it is called, it should transform the samples and return the transformed data.

Parameters:
  • transformer (dc.trans.Transformer) – The transformation to apply to each sample in the dataset.

  • parallel (bool, default False) – If True, use multiple processes to transform the dataset in parallel.

  • out_dir (str, optional (default None)) – The directory to save the new dataset in. If this is omitted, a temporary directory is created automaticall.

Returns:

A newly constructed Dataset object

Return type:

DiskDataset

make_pytorch_dataset(epochs: int = 1, deterministic: bool = False, batch_size: int | None = None)[source]

Create a torch.utils.data.IterableDataset that iterates over the data in this Dataset.

Each value returned by the Dataset’s iterator is a tuple of (X, y, w, id) containing the data for one batch, or for a single sample if batch_size is None.

Parameters:
  • epochs (int, default 1) – The number of times to iterate over the Dataset

  • deterministic (bool, default False) – If True, the data is produced in order. If False, a different random permutation of the data is used for each epoch.

  • batch_size (int, optional (default None)) – The number of samples to return in each batch. If None, each returned value is a single sample.

Returns:

torch.utils.data.IterableDataset that iterates over the data in this dataset.

Return type:

torch.utils.data.IterableDataset

Note

This method requires PyTorch to be installed.

static from_numpy(X: _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | bool | int | float | complex | str | bytes | _NestedSequence[bool | int | float | complex | str | bytes], y: _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | bool | int | float | complex | str | bytes | _NestedSequence[bool | int | float | complex | str | bytes] | None = None, w: _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | bool | int | float | complex | str | bytes | _NestedSequence[bool | int | float | complex | str | bytes] | None = None, ids: _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | bool | int | float | complex | str | bytes | _NestedSequence[bool | int | float | complex | str | bytes] | None = None, tasks: _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | bool | int | float | complex | str | bytes | _NestedSequence[bool | int | float | complex | str | bytes] | None = None, data_dir: str | None = None) DiskDataset[source]

Creates a DiskDataset object from specified Numpy arrays.

Parameters:
  • X (np.ndarray) – Feature array.

  • y (np.ndarray, optional (default None)) – Labels array.

  • w (np.ndarray, optional (default None)) – Weights array.

  • ids (np.ndarray, optional (default None)) – Identifiers array.

  • tasks (Sequence, optional (default None)) – Tasks in this dataset

  • data_dir (str, optional (default None)) – The directory to write this dataset to. If none is specified, will use a temporary directory instead.

Returns:

A new DiskDataset constructed from the provided information.

Return type:

DiskDataset

static merge(datasets: Iterable[Dataset], merge_dir: str | None = None) DiskDataset[source]

Merges provided datasets into a merged dataset.

Parameters:
  • datasets (Iterable[Dataset]) – List of datasets to merge.

  • merge_dir (str, optional (default None)) – The new directory path to store the merged DiskDataset.

Returns:

A merged DiskDataset.

Return type:

DiskDataset

subset(shard_nums: Sequence[int], subset_dir: str | None = None) DiskDataset[source]

Creates a subset of the original dataset on disk.

Parameters:
  • shard_nums (Sequence[int]) – The indices of shard to extract from the original DiskDataset.

  • subset_dir (str, optional (default None)) – The new directory path to store the subset DiskDataset.

Returns:

A subset DiskDataset.

Return type:

DiskDataset

sparse_shuffle() None[source]

Shuffling that exploits data sparsity to shuffle large datasets.

If feature vectors are sparse, say circular fingerprints or any other representation that contains few nonzero values, it can be possible to exploit the sparsity of the vector to simplify shuffles. This method implements a sparse shuffle by compressing sparse feature vectors down into a compressed representation, then shuffles this compressed dataset in memory and writes the results to disk.

Note

This method only works for 1-dimensional feature vectors (does not work for tensorial featurizations). Note that this shuffle is performed in place.

complete_shuffle(data_dir: str | None = None) Dataset[source]

Completely shuffle across all data, across all shards.

Note

The algorithm used for this complete shuffle is O(N^2) where N is the number of shards. It simply constructs each shard of the output dataset one at a time. Since the complete shuffle can take a long time, it’s useful to watch the logging output. Each shuffled shard is constructed using select() which logs as it selects from each original shard. This will results in O(N^2) logging statements, one for each extraction of shuffled shard i’s contributions from original shard j.

Parameters:

data_dir (Optional[str], (default None)) – Directory to write the shuffled dataset to. If none is specified a temporary directory will be used.

Returns:

A DiskDataset whose data is a randomly shuffled version of this dataset.

Return type:

DiskDataset

shuffle_each_shard(shard_basenames: List[str] | None = None) None[source]

Shuffles elements within each shard of the dataset.

Parameters:

shard_basenames (List[str], optional (default None)) – The basenames for each shard. If this isn’t specified, will assume the basenames of form “shard-i” used by create_dataset and reshard.

shuffle_shards() None[source]

Shuffles the order of the shards for this dataset.

get_shard(i: int) Tuple[ndarray, ndarray, ndarray, ndarray][source]

Retrieves data for the i-th shard from disk.

Parameters:

i (int) – Shard index for shard to retrieve batch from.

Returns:

A batch data for i-th shard.

Return type:

Batch

get_shard_ids(i: int) ndarray[source]

Retrieves the list of IDs for the i-th shard from disk.

Parameters:

i (int) – Shard index for shard to retrieve weights from.

Returns:

A numpy array of ids for i-th shard.

Return type:

np.ndarray

get_shard_y(i: int) ndarray[source]

Retrieves the labels for the i-th shard from disk.

Parameters:

i (int) – Shard index for shard to retrieve labels from.

Returns:

A numpy array of labels for i-th shard.

Return type:

np.ndarray

get_shard_w(i: int) ndarray[source]

Retrieves the weights for the i-th shard from disk.

Parameters:

i (int) – Shard index for shard to retrieve weights from.

Returns:

A numpy array of weights for i-th shard.

Return type:

np.ndarray

add_shard(X: ndarray, y: ndarray | None = None, w: ndarray | None = None, ids: ndarray | None = None) None[source]

Adds a data shard.

Parameters:
  • X (np.ndarray) – Feature array.

  • y (np.ndarray, optioanl (default None)) – Labels array.

  • w (np.ndarray, optioanl (default None)) – Weights array.

  • ids (np.ndarray, optioanl (default None)) – Identifiers array.

set_shard(shard_num: int, X: ndarray, y: ndarray | None = None, w: ndarray | None = None, ids: ndarray | None = None) None[source]

Writes data shard to disk.

Parameters:
  • shard_num (int) – Shard index for shard to set new data.

  • X (np.ndarray) – Feature array.

  • y (np.ndarray, optioanl (default None)) – Labels array.

  • w (np.ndarray, optioanl (default None)) – Weights array.

  • ids (np.ndarray, optioanl (default None)) – Identifiers array.

select(indices: Sequence[int] | ndarray, select_dir: str | None = None, select_shard_size: int | None = None, output_numpy_dataset: bool | None = False) Dataset[source]

Creates a new dataset from a selection of indices from self.

Examples

>>> import numpy as np
>>> X = np.random.rand(10, 10)
>>> dataset = dc.data.DiskDataset.from_numpy(X)
>>> selected = dataset.select([1, 3, 4])
>>> len(selected)
3
Parameters:
  • indices (Sequence) – List of indices to select.

  • select_dir (str, optional (default None)) – Path to new directory that the selected indices will be copied to.

  • select_shard_size (Optional[int], (default None)) – If specified, the shard-size to use for output selected DiskDataset. If not output_numpy_dataset, then this is set to this current dataset’s shard size if not manually specified.

  • output_numpy_dataset (Optional[bool], (default False)) – If True, output an in-memory NumpyDataset instead of a DiskDataset. Note that select_dir and select_shard_size must be None if this is True

Returns:

A dataset containing the selected samples. The default dataset is DiskDataset. If output_numpy_dataset is True, the dataset is NumpyDataset.

Return type:

Dataset

property ids: ndarray[source]

Get the ids vector for this dataset as a single numpy array.

property X: ndarray[source]

Get the X vector for this dataset as a single numpy array.

property y: ndarray[source]

Get the y vector for this dataset as a single numpy array.

property w: ndarray[source]

Get the weight vector for this dataset as a single numpy array.

property memory_cache_size: int[source]

Get the size of the memory cache for this dataset, measured in bytes.

__len__() int[source]

Finds number of elements in dataset.

get_shape() Tuple[Tuple[int, ...], Tuple[int, ...], Tuple[int, ...], Tuple[int, ...]][source]

Finds shape of dataset.

Returns four tuples, giving the shape of the X, y, w, and ids arrays.

get_label_means() DataFrame[source]

Return pandas series of label means.

get_label_stds() DataFrame[source]

Return pandas series of label stds.

static from_dataframe(df: DataFrame, X: str | Sequence[str] | None = None, y: str | Sequence[str] | None = None, w: str | Sequence[str] | None = None, ids: str | None = None)[source]

Construct a Dataset from the contents of a pandas DataFrame.

Parameters:
  • df (pd.DataFrame) – The pandas DataFrame

  • X (str or List[str], optional (default None)) – The name of the column or columns containing the X array. If this is None, it will look for default column names that match those produced by to_dataframe().

  • y (str or List[str], optional (default None)) – The name of the column or columns containing the y array. If this is None, it will look for default column names that match those produced by to_dataframe().

  • w (str or List[str], optional (default None)) – The name of the column or columns containing the w array. If this is None, it will look for default column names that match those produced by to_dataframe().

  • ids (str, optional (default None)) – The name of the column containing the ids. If this is None, it will look for default column names that match those produced by to_dataframe().

get_statistics(X_stats: bool = True, y_stats: bool = True) Tuple[ndarray, ...][source]

Compute and return statistics of this dataset.

Uses self.itersamples() to compute means and standard deviations of the dataset. Can compute on large datasets that don’t fit in memory.

Parameters:
  • X_stats (bool, optional (default True)) – If True, compute feature-level mean and standard deviations.

  • y_stats (bool, optional (default True)) – If True, compute label-level mean and standard deviations.

Returns:

  • If X_stats == True, returns (X_means, X_stds).

  • If y_stats == True, returns (y_means, y_stds).

  • If both are true, returns (X_means, X_stds, y_means, y_stds).

Return type:

Tuple

make_tf_dataset(batch_size: int = 100, epochs: int = 1, deterministic: bool = False, pad_batches: bool = False)[source]

Create a tf.data.Dataset that iterates over the data in this Dataset.

Each value returned by the Dataset’s iterator is a tuple of (X, y, w) for one batch.

Parameters:
  • batch_size (int, default 100) – The number of samples to include in each batch.

  • epochs (int, default 1) – The number of times to iterate over the Dataset.

  • deterministic (bool, default False) – If True, the data is produced in order. If False, a different random permutation of the data is used for each epoch.

  • pad_batches (bool, default False) – If True, batches are padded as necessary to make the size of each batch exactly equal batch_size.

Returns:

TensorFlow Dataset that iterates over the same data.

Return type:

tf.data.Dataset

Note

This class requires TensorFlow to be installed.

to_csv(path: str) None[source]

Write object to a comma-seperated values (CSV) file

Example

>>> import numpy as np
>>> X = np.random.rand(10, 10)
>>> dataset = dc.data.DiskDataset.from_numpy(X)
>>> dataset.to_csv('out.csv')  
Parameters:

path (str) – File path or object

Return type:

None

to_dataframe() DataFrame[source]

Construct a pandas DataFrame containing the data from this Dataset.

Returns:

Pandas dataframe. If there is only a single feature per datapoint, will have column “X” else will have columns “X1,X2,…” for features. If there is only a single label per datapoint, will have column “y” else will have columns “y1,y2,…” for labels. If there is only a single weight per datapoint will have column “w” else will have columns “w1,w2,…”. Will have column “ids” for identifiers.

Return type:

pd.DataFrame

ImageDataset

The dc.data.ImageDataset class is optimized to allow for convenient processing of image based datasets.

class ImageDataset(X: ndarray | List[str], y: ndarray | List[str] | None, w: _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | bool | int | float | complex | str | bytes | _NestedSequence[bool | int | float | complex | str | bytes] | None = None, ids: _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | bool | int | float | complex | str | bytes | _NestedSequence[bool | int | float | complex | str | bytes] | None = None)[source]

A Dataset that loads data from image files on disk.

__init__(X: ndarray | List[str], y: ndarray | List[str] | None, w: _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | bool | int | float | complex | str | bytes | _NestedSequence[bool | int | float | complex | str | bytes] | None = None, ids: _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | bool | int | float | complex | str | bytes | _NestedSequence[bool | int | float | complex | str | bytes] | None = None) None[source]

Create a dataset whose X and/or y array is defined by image files on disk.

Parameters:
  • X (np.ndarray or List[str]) – The dataset’s input data. This may be either a single NumPy array directly containing the data, or a list containing the paths to the image files

  • y (np.ndarray or List[str]) – The dataset’s labels. This may be either a single NumPy array directly containing the data, or a list containing the paths to the image files

  • w (np.ndarray, optional (default None)) – a 1D or 2D array containing the weights for each sample or sample/task pair

  • ids (np.ndarray, optional (default None)) – the sample IDs

__len__() int[source]

Get the number of elements in the dataset.

get_shape() Tuple[Tuple[int, ...], Tuple[int, ...], Tuple[int, ...], Tuple[int, ...]][source]

Get the shape of the dataset.

Returns four tuples, giving the shape of the X, y, w, and ids arrays.

get_task_names() ndarray[source]

Get the names of the tasks associated with this dataset.

property X: ndarray[source]

Get the X vector for this dataset as a single numpy array.

property y: ndarray[source]

Get the y vector for this dataset as a single numpy array.

property ids: ndarray[source]

Get the ids vector for this dataset as a single numpy array.

property w: ndarray[source]

Get the weight vector for this dataset as a single numpy array.

iterbatches(batch_size: int | None = None, epochs: int = 1, deterministic: bool = False, pad_batches: bool = False) Iterator[Tuple[ndarray, ndarray, ndarray, ndarray]][source]

Get an object that iterates over minibatches from the dataset.

Each minibatch is returned as a tuple of four numpy arrays: (X, y, w, ids).

Parameters:
  • batch_size (int, optional (default None)) – Number of elements in each batch.

  • epochs (int, default 1) – Number of epochs to walk over dataset.

  • deterministic (bool, default False) – If True, follow deterministic order.

  • pad_batches (bool, default False) – If True, pad each batch to batch_size.

Returns:

Generator which yields tuples of four numpy arrays (X, y, w, ids).

Return type:

Iterator[Batch]

itersamples() Iterator[Tuple[ndarray, ndarray, ndarray, ndarray]][source]

Get an object that iterates over the samples in the dataset.

Returns:

Iterator which yields tuples of four numpy arrays (X, y, w, ids).

Return type:

Iterator[Batch]

transform(transformer: Transformer, **args) NumpyDataset[source]

Construct a new dataset by applying a transformation to every sample in this dataset.

The argument is a function that can be called as follows:

>> newx, newy, neww = fn(x, y, w)

It might be called only once with the whole dataset, or multiple times with different subsets of the data. Each time it is called, it should transform the samples and return the transformed data.

Parameters:

transformer (dc.trans.Transformer) – The transformation to apply to each sample in the dataset

Returns:

A newly constructed NumpyDataset object

Return type:

NumpyDataset

select(indices: Sequence[int] | ndarray, select_dir: str | None = None) ImageDataset[source]

Creates a new dataset from a selection of indices from self.

Parameters:
  • indices (Sequence) – List of indices to select.

  • select_dir (str, optional (default None)) – Used to provide same API as DiskDataset. Ignored since ImageDataset is purely in-memory.

Returns:

A selected ImageDataset object

Return type:

ImageDataset

make_pytorch_dataset(epochs: int = 1, deterministic: bool = False, batch_size: int | None = None)[source]

Create a torch.utils.data.IterableDataset that iterates over the data in this Dataset.

Each value returned by the Dataset’s iterator is a tuple of (X, y, w, id) containing the data for one batch, or for a single sample if batch_size is None.

Parameters:
  • epochs (int, default 1) – The number of times to iterate over the Dataset.

  • deterministic (bool, default False) – If True, the data is produced in order. If False, a different random permutation of the data is used for each epoch.

  • batch_size (int, optional (default None)) – The number of samples to return in each batch. If None, each returned value is a single sample.

Returns:

torch.utils.data.IterableDataset that iterates over the data in this dataset.

Return type:

torch.utils.data.IterableDataset

Note

This method requires PyTorch to be installed.

static from_dataframe(df: DataFrame, X: str | Sequence[str] | None = None, y: str | Sequence[str] | None = None, w: str | Sequence[str] | None = None, ids: str | None = None)[source]

Construct a Dataset from the contents of a pandas DataFrame.

Parameters:
  • df (pd.DataFrame) – The pandas DataFrame

  • X (str or List[str], optional (default None)) – The name of the column or columns containing the X array. If this is None, it will look for default column names that match those produced by to_dataframe().

  • y (str or List[str], optional (default None)) – The name of the column or columns containing the y array. If this is None, it will look for default column names that match those produced by to_dataframe().

  • w (str or List[str], optional (default None)) – The name of the column or columns containing the w array. If this is None, it will look for default column names that match those produced by to_dataframe().

  • ids (str, optional (default None)) – The name of the column containing the ids. If this is None, it will look for default column names that match those produced by to_dataframe().

get_statistics(X_stats: bool = True, y_stats: bool = True) Tuple[ndarray, ...][source]

Compute and return statistics of this dataset.

Uses self.itersamples() to compute means and standard deviations of the dataset. Can compute on large datasets that don’t fit in memory.

Parameters:
  • X_stats (bool, optional (default True)) – If True, compute feature-level mean and standard deviations.

  • y_stats (bool, optional (default True)) – If True, compute label-level mean and standard deviations.

Returns:

  • If X_stats == True, returns (X_means, X_stds).

  • If y_stats == True, returns (y_means, y_stds).

  • If both are true, returns (X_means, X_stds, y_means, y_stds).

Return type:

Tuple

make_tf_dataset(batch_size: int = 100, epochs: int = 1, deterministic: bool = False, pad_batches: bool = False)[source]

Create a tf.data.Dataset that iterates over the data in this Dataset.

Each value returned by the Dataset’s iterator is a tuple of (X, y, w) for one batch.

Parameters:
  • batch_size (int, default 100) – The number of samples to include in each batch.

  • epochs (int, default 1) – The number of times to iterate over the Dataset.

  • deterministic (bool, default False) – If True, the data is produced in order. If False, a different random permutation of the data is used for each epoch.

  • pad_batches (bool, default False) – If True, batches are padded as necessary to make the size of each batch exactly equal batch_size.

Returns:

TensorFlow Dataset that iterates over the same data.

Return type:

tf.data.Dataset

Note

This class requires TensorFlow to be installed.

to_csv(path: str) None[source]

Write object to a comma-seperated values (CSV) file

Example

>>> import numpy as np
>>> X = np.random.rand(10, 10)
>>> dataset = dc.data.DiskDataset.from_numpy(X)
>>> dataset.to_csv('out.csv')  
Parameters:

path (str) – File path or object

Return type:

None

to_dataframe() DataFrame[source]

Construct a pandas DataFrame containing the data from this Dataset.

Returns:

Pandas dataframe. If there is only a single feature per datapoint, will have column “X” else will have columns “X1,X2,…” for features. If there is only a single label per datapoint, will have column “y” else will have columns “y1,y2,…” for labels. If there is only a single weight per datapoint will have column “w” else will have columns “w1,w2,…”. Will have column “ids” for identifiers.

Return type:

pd.DataFrame

Data Loaders

Processing large amounts of input data to construct a dc.data.Dataset object can require some amount of hacking. To simplify this process for you, you can use the dc.data.DataLoader classes. These classes provide utilities for you to load and process large amounts of data.

CSVLoader

class CSVLoader(tasks: List[str], featurizer: Featurizer, feature_field: str | None = None, id_field: str | None = None, smiles_field: str | None = None, log_every_n: int = 1000)[source]

Creates Dataset objects from input CSV files.

This class provides conveniences to load data from CSV files. It’s possible to directly featurize data from CSV files using pandas, but this class may prove useful if you’re processing large CSV files that you don’t want to manipulate directly in memory. Note that samples which cannot be featurized are filtered out in the creation of final dataset.

Examples

Let’s suppose we have some smiles and labels

>>> smiles = ["C", "CCC"]
>>> labels = [1.5, 2.3]

Let’s put these in a dataframe.

>>> import pandas as pd
>>> df = pd.DataFrame(list(zip(smiles, labels)), columns=["smiles", "task1"])

Let’s now write this to disk somewhere. We can now use CSVLoader to process this CSV dataset.

>>> import tempfile
>>> import deepchem as dc
>>> with dc.utils.UniversalNamedTemporaryFile(mode='w') as tmpfile:
...     df.to_csv(tmpfile.name)
...     loader = dc.data.CSVLoader(["task1"], feature_field="smiles",
...                              featurizer=dc.feat.CircularFingerprint())
...     dataset = loader.create_dataset(tmpfile.name)
>>> len(dataset)
2

Of course in practice you should already have your data in a CSV file if you’re using CSVLoader. If your data is already in memory, use InMemoryLoader instead.

Sometimes there will be datasets without specific tasks, for example datasets which are used in unsupervised learning tasks. Such datasets can be loaded by leaving the tasks field empty.

Example

>>> x1, x2 = [2, 3, 4], [4, 6, 8]
>>> df = pd.DataFrame({"x1":x1, "x2": x2}).reset_index()
>>> with dc.utils.UniversalNamedTemporaryFile(mode='w') as tmpfile:
...     df.to_csv(tmpfile.name)
...     loader = dc.data.CSVLoader(tasks=[], id_field="index", feature_field=["x1", "x2"],
...                              featurizer=dc.feat.DummyFeaturizer())
...     dataset = loader.create_dataset(tmpfile.name)
>>> len(dataset)
3
__init__(tasks: List[str], featurizer: Featurizer, feature_field: str | None = None, id_field: str | None = None, smiles_field: str | None = None, log_every_n: int = 1000)[source]

Initializes CSVLoader.

Parameters:
  • tasks (List[str]) – List of task names

  • featurizer (Featurizer) – Featurizer to use to process data.

  • feature_field (str, optional (default None)) – Field with data to be featurized.

  • id_field (str, optional, (default None)) – CSV column that holds sample identifier

  • smiles_field (str, optional (default None) (DEPRECATED)) – Name of field that holds smiles string.

  • log_every_n (int, optional (default 1000)) – Writes a logging statement this often.

create_dataset(inputs: Any | Sequence[Any], data_dir: str | None = None, shard_size: int | None = 8192) Dataset[source]

Creates and returns a Dataset object by featurizing provided files.

Reads in inputs and uses self.featurizer to featurize the data in these inputs. For large files, automatically shards into smaller chunks of shard_size datapoints for convenience. Returns a Dataset object that contains the featurized dataset.

This implementation assumes that the helper methods _get_shards and _featurize_shard are implemented and that each shard returned by _get_shards is a pandas dataframe. You may choose to reuse or override this method in your subclass implementations.

Parameters:
  • inputs (List) – List of inputs to process. Entries can be filenames or arbitrary objects.

  • data_dir (str, optional (default None)) – Directory to store featurized dataset.

  • shard_size (int, optional (default 8192)) – Number of examples stored in each shard.

Returns:

A DiskDataset object containing a featurized representation of data from inputs.

Return type:

DiskDataset

UserCSVLoader

class UserCSVLoader(tasks: List[str], featurizer: Featurizer, feature_field: str | None = None, id_field: str | None = None, smiles_field: str | None = None, log_every_n: int = 1000)[source]

Handles loading of CSV files with user-defined features.

This is a convenience class that allows for descriptors already present in a CSV file to be extracted without any featurization necessary.

Examples

Let’s suppose we have some descriptors and labels. (Imagine that these descriptors have been computed by an external program.)

>>> desc1 = [1, 43]
>>> desc2 = [-2, -22]
>>> labels = [1.5, 2.3]
>>> ids = ["cp1", "cp2"]

Let’s put these in a dataframe.

>>> import pandas as pd
>>> df = pd.DataFrame(list(zip(ids, desc1, desc2, labels)), columns=["id", "desc1", "desc2", "task1"])

Let’s now write this to disk somewhere. We can now use UserCSVLoader to process this CSV dataset.

>>> import tempfile
>>> import deepchem as dc
>>> featurizer = dc.feat.UserDefinedFeaturizer(["desc1", "desc2"])
>>> with dc.utils.UniversalNamedTemporaryFile(mode='w') as tmpfile:
...     df.to_csv(tmpfile.name)
...     loader = dc.data.UserCSVLoader(["task1"], id_field="id",
...                              featurizer=featurizer)
...     dataset = loader.create_dataset(tmpfile.name)
>>> len(dataset)
2
>>> dataset.X[0, 0]
1

The difference between UserCSVLoader and CSVLoader is that our descriptors (our features) have already been computed for us, but are spread across multiple columns of the CSV file.

Of course in practice you should already have your data in a CSV file if you’re using UserCSVLoader. If your data is already in memory, use InMemoryLoader instead.

__init__(tasks: List[str], featurizer: Featurizer, feature_field: str | None = None, id_field: str | None = None, smiles_field: str | None = None, log_every_n: int = 1000)[source]

Initializes CSVLoader.

Parameters:
  • tasks (List[str]) – List of task names

  • featurizer (Featurizer) – Featurizer to use to process data.

  • feature_field (str, optional (default None)) – Field with data to be featurized.

  • id_field (str, optional, (default None)) – CSV column that holds sample identifier

  • smiles_field (str, optional (default None) (DEPRECATED)) – Name of field that holds smiles string.

  • log_every_n (int, optional (default 1000)) – Writes a logging statement this often.

create_dataset(inputs: Any | Sequence[Any], data_dir: str | None = None, shard_size: int | None = 8192) Dataset[source]

Creates and returns a Dataset object by featurizing provided files.

Reads in inputs and uses self.featurizer to featurize the data in these inputs. For large files, automatically shards into smaller chunks of shard_size datapoints for convenience. Returns a Dataset object that contains the featurized dataset.

This implementation assumes that the helper methods _get_shards and _featurize_shard are implemented and that each shard returned by _get_shards is a pandas dataframe. You may choose to reuse or override this method in your subclass implementations.

Parameters:
  • inputs (List) – List of inputs to process. Entries can be filenames or arbitrary objects.

  • data_dir (str, optional (default None)) – Directory to store featurized dataset.

  • shard_size (int, optional (default 8192)) – Number of examples stored in each shard.

Returns:

A DiskDataset object containing a featurized representation of data from inputs.

Return type:

DiskDataset

ImageLoader

class ImageLoader(tasks: List[str] | None = None, sorting: bool = True)[source]

Creates Dataset objects from input image files.

This class allows for loading of images in various formats. For user convenience, also accepts zip-files and directories of images and uses some limited intelligence to attempt to traverse subdirectories which contain images.

Currently, only .png and .tif files are supported. If the inputs or labels are given as a list of files, the list must contain only image files.

Examples

For this example, we will be using the BBBC001 Dataset. This dataset contains 6 images of human HT29 colon cancer cells. We will use the images as inputs and we will assign the labels as integers ranging from 1 to 6 for the sake of simplicity.

To learn more about this dataset, please visit: https://data.broadinstitute.org/bbbc/BBBC001/ and also see our loader for this dataset: deepchem.molnet.loadbbbc001.

Let’s begin by importing the necessary modules and downloading the dataset. >>> import os >>> import deepchem as dc >>> data_dir = dc.utils.data_utils.get_data_dir() >>> dataset_file = os.path.join(data_dir, “BBBC001_v1_images_tif.zip”) >>> BBBC1_IMAGE_URL = ‘https://data.broadinstitute.org/bbbc/BBBC001/BBBC001_v1_images_tif.zip’ >>> if not os.path.exists(dataset_file): … dc.utils.data_utils.download_url(url=BBBC1_IMAGE_URL, dest_dir=data_dir)

Now that we have the dataset, let’s create a list of labels for each image.

>>> labels = np.array([1,2,3,4,5,6])

Let’s now write this to disk somewhere. We can now use ImageLoader to process this Image dataset. We do not use a featurizer here, hence the UserDefinedFeaturizer with an empty list.

>>> featurizer = dc.feat.UserDefinedFeaturizer([])
>>> loader = dc.data.ImageLoader(tasks=['demo-task'], sorting=False)
>>> dataset = loader.create_dataset(inputs=(dataset_file, labels),
...                                 in_memory=False)

We can confirm that we have 6 images in our dataset and 6 labels. The images are of size 512x512 while the labels are just integers.

>>> len(dataset)
6
>>> dataset.X.shape
(6, 512, 512)
>>> dataset.y.shape
(6,)

The label files can also be images similar to the inputs, in which case we can provide a list of label files instead of a list of labels.

To show this, we will use the input data as the ground truths, this is often seen when making autoencoders. Similar to the above example, let’s use ImageLoader to process this Image dataset.

>>> featurizer = dc.feat.UserDefinedFeaturizer([])
>>> loader = dc.data.ImageLoader(tasks=['demo-task'], sorting=False)
>>> dataset = loader.create_dataset(inputs=(dataset_file, dataset_file),
...                                 in_memory=False)

We can confirm that we have 6 images in our dataset and 6 labels. The images are of size 512x512 while the labels are also images of size 512x512.

>>> len(dataset)
6
>>> dataset.X.shape
(6, 512, 512)
>>> dataset.y.shape
(6, 512, 512)
__init__(tasks: List[str] | None = None, sorting: bool = True)[source]

Initialize image loader.

At present, custom image featurizers aren’t supported by this loader class.

Parameters:
  • tasks (List[str], optional (default None)) – List of task names for image labels.

  • sorting (bool, optional (default True)) – Whether to sort image files by filename.

create_dataset(inputs: str | Sequence[str] | Tuple[Any] | Tuple[str, Any], data_dir: str | None = None, shard_size: int | None = 8192, in_memory: bool = False) Dataset[source]

Creates and returns a Dataset object by featurizing provided image files and labels/weights.

Parameters:
  • inputs (Union[OneOrMany[str], Tuple[Any]]) –

    The inputs provided should be one of the following

    • filename

    • list of filenames

    • Tuple (list of filenames, labels)

    • Tuple (list of filenames, list of label filenames)

    • Tuple (list of filenames, labels, weights)

    • Tuple (list of filenames, list of label filenames, weights)

    Each file in a given list of filenames should either be of a supported image format (.png, .tif only for now) or of a compressed folder of image files (only .zip for now). If labels or weights are provided, they must correspond to the sorted order of all filenames provided, with one label/weight per file. Labels can be filenames too, in which case the labels are loaded as images.

  • data_dir (str, optional (default None)) – Directory to store featurized dataset.

  • shard_size (int, optional (default 8192)) – Shard size when loading data.

  • in_memory (bool, optioanl (default False)) – If true, return in-memory NumpyDataset. Else return ImageDataset.

Returns:

  • if in_memory == False, the return value is ImageDataset.

  • if in_memory == True and data_dir is None, the return value is NumpyDataset.

  • if in_memory == True and data_dir is not None, the return value is DiskDataset.

Return type:

ImageDataset or NumpyDataset or DiskDataset

JsonLoader

JSON is a flexible file format that is human-readable, lightweight, and more compact than other open standard formats like XML. JSON files are similar to python dictionaries of key-value pairs. All keys must be strings, but values can be any of (string, number, object, array, boolean, or null), so the format is more flexible than CSV. JSON is used for describing structured data and to serialize objects. It is conveniently used to read/write Pandas dataframes with the pandas.read_json and pandas.write_json methods.

class JsonLoader(tasks: List[str], feature_field: str, featurizer: Featurizer, label_field: str | None = None, weight_field: str | None = None, id_field: str | None = None, log_every_n: int = 1000)[source]

Creates Dataset objects from input json files.

This class provides conveniences to load data from json files. It’s possible to directly featurize data from json files using pandas, but this class may prove useful if you’re processing large json files that you don’t want to manipulate directly in memory.

It is meant to load JSON files formatted as “records” in line delimited format, which allows for sharding. list like [{column -> value}, ... , {column -> value}].

Examples

Let’s create the sample dataframe.

>>> composition = ["LiCoO2", "MnO2"]
>>> labels = [1.5, 2.3]
>>> import pandas as pd
>>> df = pd.DataFrame(list(zip(composition, labels)), columns=["composition", "task"])

Dump the dataframe to the JSON file formatted as “records” in line delimited format and load the json file by JsonLoader.

>>> import tempfile
>>> import deepchem as dc
>>> with dc.utils.UniversalNamedTemporaryFile(mode='w') as tmpfile:
...     df.to_json(tmpfile.name, orient='records', lines=True)
...     featurizer = dc.feat.ElementPropertyFingerprint()
...     loader = dc.data.JsonLoader(["task"], feature_field="composition", featurizer=featurizer)
...     dataset = loader.create_dataset(tmpfile.name)
>>> len(dataset)
2
__init__(tasks: List[str], feature_field: str, featurizer: Featurizer, label_field: str | None = None, weight_field: str | None = None, id_field: str | None = None, log_every_n: int = 1000)[source]

Initializes JsonLoader.

Parameters:
  • tasks (List[str]) – List of task names

  • feature_field (str) – JSON field with data to be featurized.

  • featurizer (Featurizer) – Featurizer to use to process data

  • label_field (str, optional (default None)) – Field with target variables.

  • weight_field (str, optional (default None)) – Field with weights.

  • id_field (str, optional (default None)) – Field for identifying samples.

  • log_every_n (int, optional (default 1000)) – Writes a logging statement this often.

create_dataset(input_files: str | Sequence[str], data_dir: str | None = None, shard_size: int | None = 8192) DiskDataset[source]

Creates a Dataset from input JSON files.

Parameters:
  • input_files (OneOrMany[str]) – List of JSON filenames.

  • data_dir (Optional[str], default None) – Name of directory where featurized data is stored.

  • shard_size (int, optional (default 8192)) – Shard size when loading data.

Returns:

A DiskDataset object containing a featurized representation of data from input_files.

Return type:

DiskDataset

SDFLoader

class SDFLoader(tasks: List[str], featurizer: Featurizer, sanitize: bool = False, log_every_n: int = 1000)[source]

Creates a Dataset object from SDF input files.

This class provides conveniences to load and featurize data from Structure Data Files (SDFs). SDF is a standard format for structural information (3D coordinates of atoms and bonds) of molecular compounds.

Examples

>>> import deepchem as dc
>>> import os
>>> current_dir = os.path.dirname(os.path.realpath(__file__))
>>> featurizer = dc.feat.CircularFingerprint(size=16)
>>> loader = dc.data.SDFLoader(["LogP(RRCK)"], featurizer=featurizer, sanitize=True)
>>> dataset = loader.create_dataset(os.path.join(current_dir, "tests", "membrane_permeability.sdf")) 
>>> len(dataset)
2
__init__(tasks: List[str], featurizer: Featurizer, sanitize: bool = False, log_every_n: int = 1000)[source]

Initialize SDF Loader

Parameters:
  • tasks (list[str]) – List of tasknames. These will be loaded from the SDF file.

  • featurizer (Featurizer) – Featurizer to use to process data

  • sanitize (bool, optional (default False)) – Whether to sanitize molecules.

  • log_every_n (int, optional (default 1000)) – Writes a logging statement this often.

create_dataset(inputs: Any | Sequence[Any], data_dir: str | None = None, shard_size: int | None = 8192) Dataset[source]

Creates and returns a Dataset object by featurizing provided sdf files.

Parameters:
  • inputs (List) – List of inputs to process. Entries can be filenames or arbitrary objects. Each file should be supported format (.sdf) or compressed folder of .sdf files

  • data_dir (str, optional (default None)) – Directory to store featurized dataset.

  • shard_size (int, optional (default 8192)) – Number of examples stored in each shard.

Returns:

A DiskDataset object containing a featurized representation of data from inputs.

Return type:

DiskDataset

FASTALoader

class FASTALoader(featurizer: Featurizer | None = None, auto_add_annotations: bool = False, legacy: bool = True)[source]

Handles loading of FASTA files.

FASTA files are commonly used to hold sequence data. This class provides convenience files to lead FASTA data and one-hot encode the genomic sequences for use in downstream learning tasks.

__init__(featurizer: Featurizer | None = None, auto_add_annotations: bool = False, legacy: bool = True)[source]

Initialize FASTALoader.

Parameters:
  • featurizer (Featurizer (default: None)) –

    The Featurizer to be used for the loaded FASTA data.

    If featurizer is None and legacy is True, the original featurization logic is used, creating a one hot encoding of all included FASTA strings of shape (number of FASTA sequences, number of channels + 1, sequence length, 1).

    If featurizer is None and legacy is False, the featurizer is initialized as a OneHotFeaturizer object with charset (“A”, “C”, “T”, “G”) and max_length = None.

  • auto_add_annotations (bool (default False)) – Whether create_dataset will automatically add [CLS] and [SEP] annotations to the sequences it reads in order to assist tokenization. Keep False if your FASTA file already includes [CLS] and [SEP] annotations.

  • legacy (bool (default True)) –

    Whether to use legacy logic for featurization. Legacy mode will create a one hot encoding of the FASTA content of shape (number of FASTA sequences, number of channels + 1, max length, 1).

    Legacy mode is only tested for ACTGN charsets, and will be deprecated.

create_dataset(input_files: str | Sequence[str], data_dir: str | None = None, shard_size: int | None = None) DiskDataset[source]

Creates a Dataset from input FASTA files.

At present, FASTA support is limited and doesn’t allow for sharding.

Parameters:
  • input_files (List[str]) – List of fasta files.

  • data_dir (str, optional (default None)) – Name of directory where featurized data is stored.

  • shard_size (int, optional (default None)) – For now, this argument is ignored and each FASTA file gets its own shard.

Returns:

A DiskDataset object containing a featurized representation of data from input_files.

Return type:

DiskDataset

FASTQLoader

InMemoryLoader

The dc.data.InMemoryLoader is designed to facilitate the processing of large datasets where you already hold the raw data in-memory (say in a pandas dataframe).

class InMemoryLoader(tasks: List[str], featurizer: Featurizer, id_field: str | None = None, log_every_n: int = 1000)[source]

Facilitate Featurization of In-memory objects.

When featurizing a dataset, it’s often the case that the initial set of data (pre-featurization) fits handily within memory. (For example, perhaps it fits within a column of a pandas DataFrame.) In this case, it would be convenient to directly be able to featurize this column of data. However, the process of featurization often generates large arrays which quickly eat up available memory. This class provides convenient capabilities to process such in-memory data by checkpointing generated features periodically to disk.

Example

Here’s an example with only datapoints and no labels or weights.

>>> import deepchem as dc
>>> smiles = ["C", "CC", "CCC", "CCCC"]
>>> featurizer = dc.feat.CircularFingerprint()
>>> loader = dc.data.InMemoryLoader(tasks=["task1"], featurizer=featurizer)
>>> dataset = loader.create_dataset(smiles, shard_size=2)
>>> len(dataset)
4

Here’s an example with both datapoints and labels

>>> import deepchem as dc
>>> smiles = ["C", "CC", "CCC", "CCCC"]
>>> labels = [1, 0, 1, 0]
>>> featurizer = dc.feat.CircularFingerprint()
>>> loader = dc.data.InMemoryLoader(tasks=["task1"], featurizer=featurizer)
>>> dataset = loader.create_dataset(zip(smiles, labels), shard_size=2)
>>> len(dataset)
4

Here’s an example with datapoints, labels, weights and ids all provided.

>>> import deepchem as dc
>>> smiles = ["C", "CC", "CCC", "CCCC"]
>>> labels = [1, 0, 1, 0]
>>> weights = [1.5, 0, 1.5, 0]
>>> ids = ["C", "CC", "CCC", "CCCC"]
>>> featurizer = dc.feat.CircularFingerprint()
>>> loader = dc.data.InMemoryLoader(tasks=["task1"], featurizer=featurizer)
>>> dataset = loader.create_dataset(zip(smiles, labels, weights, ids), shard_size=2)
>>> len(dataset)
4
create_dataset(inputs: Sequence[Any], data_dir: str | None = None, shard_size: int | None = 8192) DiskDataset[source]

Creates and returns a Dataset object by featurizing provided files.

Reads in inputs and uses self.featurizer to featurize the data in these input files. For large files, automatically shards into smaller chunks of shard_size datapoints for convenience. Returns a Dataset object that contains the featurized dataset.

This implementation assumes that the helper methods _get_shards and _featurize_shard are implemented and that each shard returned by _get_shards is a pandas dataframe. You may choose to reuse or override this method in your subclass implementations.

Parameters:
  • inputs (Sequence[Any]) – List of inputs to process. Entries can be arbitrary objects so long as they are understood by self.featurizer

  • data_dir (str, optional (default None)) – Directory to store featurized dataset.

  • shard_size (int, optional (default 8192)) – Number of examples stored in each shard.

Returns:

A DiskDataset object containing a featurized representation of data from inputs.

Return type:

DiskDataset

__init__(tasks: List[str], featurizer: Featurizer, id_field: str | None = None, log_every_n: int = 1000)[source]

Construct a DataLoader object.

This constructor is provided as a template mainly. You shouldn’t ever call this constructor directly as a user.

Parameters:
  • tasks (List[str]) – List of task names

  • featurizer (Featurizer) – Featurizer to use to process data.

  • id_field (str, optional (default None)) – Name of field that holds sample identifier. Note that the meaning of “field” depends on the input data type and can have a different meaning in different subclasses. For example, a CSV file could have a field as a column, and an SDF file could have a field as molecular property.

  • log_every_n (int, optional (default 1000)) – Writes a logging statement this often.

Density Functional Theory YAML Loader

class DFTYamlLoader[source]

Creates a Dataset object from YAML input files.

This class provides methods to load and featurize data from a YAML file. Although, in this class, we only focus on a specfic input format that can be used to perform Density Functional Theory calculations.

Examples

>>> from deepchem.data.data_loader import DFTYamlLoader
>>> import deepchem as dc
>>> import pytest
>>> inputs = 'deepchem/data/tests/dftdata.yaml'
>>> data = DFTYamlLoader()
>>> output = data.create_dataset(inputs)

Notes

Format (and example) for the YAML file:

  • e_type : ‘ae’ true_val : ‘0.09194410469’ systems : [{

    ‘moldesc’: ‘Li 1.5070 0 0; H -1.5070 0 0’, ‘basis’: ‘6-311++G(3df,3pd)’

    }]

Each entry in the YAML file must contain the three parameters : e_type, true_val and systems in this particular order. One entry object may contain one or more systems. This data class does not support/ require an additional featurizer, since the datapoints are featurized within the methods. To read more about the parameters and their possible values please refer to deepchem.feat.dft_data.

__init__()[source]

Initialize DFTYAML loader

create_dataset(inputs: Any | Sequence[Any], data_dir: str | None = None, shard_size: int | None = 1) Dataset[source]

Creates and returns a Dataset object by featurizing provided YAML files.

Parameters:
  • input_files (OneOrMany[str]) – List of YAML filenames.

  • data_dir (Optional[str], default None) – Name of directory where featurized data is stored.

  • shard_size (int, optional (default 1)) – Shard size when loading data.

Returns:

A DiskDataset object containing a featurized representation of data from inputs.

Return type:

DiskDataset

SAM Loader

class SAMLoader(featurizer: Featurizer | None = None)[source]

Handles loading of SAM files. Sequence Alignment Map (SAM) is a text-based format used for storing biological sequences aligned to a reference sequence.It is generally used for storing nucleotide sequences, generated by next generation sequencing technologies, and unmapped sequences. SAM files have a header section and an alignment section.Alignment sections have 11 mandatory fields, as well as a variable number of optional fields. Here, we extract Query Name, Query Sequence, Query Length, Reference Name, Reference Start, CIGAR and Mapping Quality of each read in the SAM file. This class provides methods to load and featurize data from SAM files.

Examples

>>> from deepchem.data.data_loader import SAMLoader
>>> import deepchem as dc
>>> import pytest
>>> inputs = 'deepchem/data/tests/example.sam'
>>> data = SAMLoader()
>>> output = data.create_dataset(inputs)

Note

This class requires pysam to be installed. Pysam can be used with Linux or MacOS X. To use Pysam on Windows, use Windows Subsystem for Linux(WSL).

__init__(featurizer: Featurizer | None = None)[source]

Initialize SAMLoader.

Parameters:

featurizer (Featurizer (default: None)) – The Featurizer to be used for the loaded SAM data.

create_dataset(input_files: str | Sequence[str], data_dir: str | None = None, shard_size: int | None = None) DiskDataset[source]

Creates a Dataset from input SAM files.

Parameters:
  • input_files (List[str]) – List of SAM files.

  • data_dir (str, optional (default None)) – Name of directory where featurized data is stored.

  • shard_size (int, optional (default None)) – For now, this argument is ignored and each SAM file gets its own shard.

Returns:

A DiskDataset object containing a featurized representation of data from input_files.

Return type:

DiskDataset

Data Classes

DeepChem featurizers often transform members into “data classes”. These are classes that hold all the information needed to train a model on that data point. Models then transform these into the tensors for training in their default_generator methods.

Graph Data

These classes document the data classes for graph convolutions. We plan to simplify these classes (ConvMol, MultiConvMol, WeaveMol) into a joint data representation (GraphData) for all graph convolutions in a future version of DeepChem, so these APIs may not remain stable.

The graph convolution models which inherit KerasModel depend on ConvMol, MultiConvMol, or WeaveMol. On the other hand, the graph convolution models which inherit TorchModel depend on GraphData.

class ConvMol(atom_features, adj_list, max_deg=10, min_deg=0)[source]

Holds information about a molecules.

Resorts order of atoms internally to be in order of increasing degree. Note that only heavy atoms (hydrogens excluded) are considered here.

__init__(atom_features, adj_list, max_deg=10, min_deg=0)[source]
Parameters:
  • atom_features (np.ndarray) – Has shape (n_atoms, n_feat)

  • adj_list (list) – List of length n_atoms, with neighor indices of each atom.

  • max_deg (int, optional) – Maximum degree of any atom.

  • min_deg (int, optional) – Minimum degree of any atom.

get_atoms_with_deg(deg)[source]

Retrieves atom_features with the specific degree

get_num_atoms_with_deg(deg)[source]

Returns the number of atoms with the given degree

get_atom_features()[source]

Returns canonicalized version of atom features.

Features are sorted by atom degree, with original order maintained when degrees are same.

get_adjacency_list()[source]

Returns a canonicalized adjacency list.

Canonicalized means that the atoms are re-ordered by degree.

Returns:

Canonicalized form of adjacency list.

Return type:

list

get_deg_adjacency_lists()[source]

Returns adjacency lists grouped by atom degree.

Returns:

Has length (max_deg+1-min_deg). The element at position deg is itself a list of the neighbor-lists for atoms with degree deg.

Return type:

list

get_deg_slice()[source]

Returns degree-slice tensor.

The deg_slice tensor allows indexing into a flattened version of the molecule’s atoms. Assume atoms are sorted in order of degree. Then deg_slice[deg][0] is the starting position for atoms of degree deg in flattened list, and deg_slice[deg][1] is the number of atoms with degree deg.

Note deg_slice has shape (max_deg+1-min_deg, 2).

Returns:

deg_slice – Shape (max_deg+1-min_deg, 2)

Return type:

np.ndarray

static get_null_mol(n_feat, max_deg=10, min_deg=0)[source]

Constructs a null molecules

Get one molecule with one atom of each degree, with all the atoms connected to themselves, and containing n_feat features.

Parameters:

n_feat (int) – number of features for the nodes in the null molecule

static agglomerate_mols(mols, max_deg=10, min_deg=0)[source]
Concatenates list of ConvMol’s into one mol object that can be used to feed

into tensorflow placeholders. The indexing of the molecules are preseved during the combination, but the indexing of the atoms are greatly changed.

Parameters:

mols (list) – ConvMol objects to be combined into one molecule.

class MultiConvMol(nodes, deg_adj_lists, deg_slice, membership, num_mols)[source]

Holds information about multiple molecules, for use in feeding information into tensorflow. Generated using the agglomerate_mols function

__init__(nodes, deg_adj_lists, deg_slice, membership, num_mols)[source]
get_deg_adjacency_lists()[source]
get_atom_features()[source]
get_num_atoms()[source]
get_num_molecules()[source]
__module__ = 'deepchem.feat.mol_graphs'[source]
class WeaveMol(nodes, pairs, pair_edges)[source]

Molecular featurization object for weave convolutions.

These objects are produced by WeaveFeaturizer, and feed into WeaveModel. The underlying implementation is inspired by [1].

References

__init__(nodes, pairs, pair_edges)[source]
get_pair_edges()[source]
get_pair_features()[source]
get_atom_features()[source]
get_num_atoms()[source]
get_num_features()[source]
__module__ = 'deepchem.feat.mol_graphs'[source]
class GraphData(node_features: ndarray, edge_index: ndarray, edge_features: ndarray | None = None, node_pos_features: ndarray | None = None, **kwargs)[source]

GraphData class

This data class is almost same as torch_geometric.data.Data.

node_features[source]

Node feature matrix with shape [num_nodes, num_node_features]

Type:

np.ndarray

edge_index[source]

Graph connectivity in COO format with shape [2, num_edges]

Type:

np.ndarray, dtype int

edge_features[source]

Edge feature matrix with shape [num_edges, num_edge_features]

Type:

np.ndarray, optional (default None)

node_pos_features[source]

Node position matrix with shape [num_nodes, num_dimensions].

Type:

np.ndarray, optional (default None)

num_nodes[source]

The number of nodes in the graph

Type:

int

num_node_features[source]

The number of features per node in the graph

Type:

int

num_edges[source]

The number of edges in the graph

Type:

int

num_edges_features[source]

The number of features per edge in the graph

Type:

int, optional (default None)

Examples

>>> import numpy as np
>>> node_features = np.random.rand(5, 10)
>>> edge_index = np.array([[0, 1, 2, 3, 4], [1, 2, 3, 4, 0]], dtype=np.int64)
>>> edge_features = np.random.rand(5, 5)
>>> global_features = np.random.random(5)
>>> graph = GraphData(node_features, edge_index, edge_features, z=global_features)
>>> graph
GraphData(node_features=[5, 10], edge_index=[2, 5], edge_features=[5, 5], z=[5])
__init__(node_features: ndarray, edge_index: ndarray, edge_features: ndarray | None = None, node_pos_features: ndarray | None = None, **kwargs)[source]
Parameters:
  • node_features (np.ndarray) – Node feature matrix with shape [num_nodes, num_node_features]

  • edge_index (np.ndarray, dtype int) – Graph connectivity in COO format with shape [2, num_edges]

  • edge_features (np.ndarray, optional (default None)) – Edge feature matrix with shape [num_edges, num_edge_features]

  • node_pos_features (np.ndarray, optional (default None)) – Node position matrix with shape [num_nodes, num_dimensions].

  • kwargs (optional) – Additional attributes and their values

to_pyg_graph()[source]

Convert to PyTorch Geometric graph data instance

Returns:

Graph data for PyTorch Geometric

Return type:

torch_geometric.data.Data

Note

This method requires PyTorch Geometric to be installed.

to_dgl_graph(self_loop: bool = False)[source]

Convert to DGL graph data instance

Returns:

  • dgl.DGLGraph – Graph data for DGL

  • self_loop (bool) – Whether to add self loops for the nodes, i.e. edges from nodes to themselves. Default to False.

Note

This method requires DGL to be installed.

numpy_to_torch(device: str = 'cpu')[source]

Convert numpy arrays to torch tensors. This may be useful when you are using PyTorch Geometric with GraphData objects.

Parameters:

device (str) – Device to store the tensors. Default to ‘cpu’.

Example

>>> num_nodes, num_node_features = 5, 32
>>> num_edges, num_edge_features = 6, 32
>>> node_features = np.random.random_sample((num_nodes, num_node_features))
>>> edge_features = np.random.random_sample((num_edges, num_edge_features))
>>> edge_index = np.random.randint(0, num_nodes, (2, num_edges))
>>> graph_data = GraphData(node_features, edge_index, edge_features)
>>> graph_data = graph_data.numpy_to_torch()
>>> print(type(graph_data.node_features))
<class 'torch.Tensor'>
subgraph(nodes)[source]

Returns a subgraph of nodes indicies.

Parameters:

nodes (list, iterable) – A list of node indices to be included in the subgraph.

Returns:

subgraph_data – A new GraphData object containing the subgraph induced on nodes.

Return type:

GraphData

Example

>>> import numpy as np
>>> from deepchem.feat.graph_data import GraphData
>>> node_features = np.random.rand(5, 10)
>>> edge_index = np.array([[0, 1, 2, 3, 4], [1, 2, 3, 4, 0]], dtype=np.int64)
>>> edge_features = np.random.rand(5, 3)
>>> graph_data = GraphData(node_features, edge_index, edge_features)
>>> nodes = [0, 2, 4]
>>> subgraph_data, node_mapping = graph_data.subgraph(nodes)

Density Functional Theory Data

These Data classes are used to create entry objects for DFT calculations.

class DFTSystem(system: Dict)[source]

The DFTSystem class creates and returns the various systems in an entry object as dictionaries.

Examples

>>> from deepchem.feat.dft_data import DFTSystem
>>> systems = {'moldesc': 'Li 1.5070 0 0; H -1.5070 0 0','basis': '6-311++G(3df,3pd)'}
>>> output = DFTSystem(systems)
Return type:

DFTSystem object for all the individual atoms/ions/molecules in an entry object.

References

Kasim, Muhammad F., and Sam M. Vinko. “Learning the exchange-correlation functional from nature with fully differentiable density functional theory.” Physical Review Letters 127.12 (2021): 126403.

https://github.com/diffqc/dqc/blob/0fe821fc92cb3457fb14f6dff0c223641c514ddb/dqc/system/base_system.py

__init__(system: Dict)[source]
get_dqc_mol(pos_reqgrad: bool = False) BaseSystem[source]

This method converts the system dictionary to a DQC system and returns it. :param pos_reqgrad: decides if the atomic position require gradient calculation. :type pos_reqgrad: bool

Returns:

DQC mol object

Return type:

mol

class DFTEntry(e_type: str, true_val: str | None, systems: List[Dict], weight: int | None = 1)[source]

Handles creating and initialising DFTEntry objects from the dataset. This object contains information about the various systems in the datapoint (atoms, molecules and ions) along with the ground truth values. .. rubric:: Notes

Entry class should not be initialized directly, but created through Entry.create

Example

>>> from deepchem.feat.dft_data import DFTEntry
>>> e_type= 'dm'
>>> true_val= 'deepchem/data/tests/dftHF_output.npy'
>>> systems = [{'moldesc': 'H 0.86625 0 0; F -0.86625 0 0','basis': '6-311++G(3df,3pd)'}]
>>> dm_entry_for_HF = DFTEntry.create(e_type, true_val, systems)
classmethod create(e_type: str, true_val: str | None, systems: List[Dict], weight: int | None = 1)[source]

This method is used to initialise the DFTEntry class. The entry objects are created based on their entry type.

Parameters:
  • e_type (str) – Determines the type of calculation to be carried out on the entry object. Accepts the following values: “ae”, “ie”, “dm”, “dens”, that stand for atomization energy, ionization energy, density matrix and density profile respectively.

  • true_val (str) – Ground state energy values for the entry object as a string (for ae and ie), or a .npy file containing a matrix ( for dm and dens).

  • systems (List[Dict]) – List of dictionaries contains “moldesc”, “basis” and “spin” of all the atoms/molecules. These values are to be entered in the DQC or PYSCF format. The systems needs to be entered in a specific order, i.e ; the main atom/molecule needs to be the first element. (This is for objects containing equations, such as ae and ie entry objects). Spin and charge of the system are optional parameters and are considered ‘0’ if not specified. The system number refers to the number of times the systems is present in the molecule - this is for polyatomic molecules and the default value is 1. For example ; system number of Hydrogen in water is 2.

  • weight (int) – Weight of the entry object.

Return type:

DFTEntry object based on entry type

__init__(e_type: str, true_val: str | None, systems: List[Dict], weight: int | None = 1)[source]
get_systems() List[DFTSystem][source]
Parameters:

systems (List[DFTSystem]) –

Return type:

List of systems in the entry

abstract property entry_type: str[source]

returns: * The type of entry ; * 1) Atomic Ionization Potential (IP/IE) * 2) Atomization Energy (AE) * 3) Density Profile (DENS) * 4) Density Matrix (DM)

get_true_val() ndarray[source]

Get the true value of the DFTEntry. For the AE and IP entry types, the experimental values are collected from the NIST CCCBDB/ASD databases. The true values of density profiles are calculated using PYSCF-CCSD calculations. This method simply loads the value, no calculation is performed.

abstract get_val(qcs: List[KSCalc]) ndarray[source]

Return the energy value of the entry, using a DQC-DFT calculation, where the XC has been replaced by the trained neural network. This method does not carry out any calculations, it is an interface to the KSCalc utility.

get_weight()[source]
Return type:

Weight of the entry object

Base Classes (for develop)

Dataset

The dc.data.Dataset class is the abstract parent class for all datasets. This class should never be directly initialized, but contains a number of useful method implementations.

class Dataset[source]

Abstract base class for datasets defined by X, y, w elements.

Dataset objects are used to store representations of a dataset as used in a machine learning task. Datasets contain features X, labels y, weights w and identifiers ids. Different subclasses of Dataset may choose to hold X, y, w, ids in memory or on disk.

The Dataset class attempts to provide for strong interoperability with other machine learning representations for datasets. Interconversion methods allow for Dataset objects to be converted to and from numpy arrays, pandas dataframes, tensorflow datasets, and pytorch datasets (only to and not from for pytorch at present).

Note that you can never instantiate a Dataset object directly. Instead you will need to instantiate one of the concrete subclasses.

__init__() None[source]
__len__() int[source]

Get the number of elements in the dataset.

Returns:

The number of elements in the dataset.

Return type:

int

get_shape() Tuple[Tuple[int, ...], Tuple[int, ...], Tuple[int, ...], Tuple[int, ...]][source]

Get the shape of the dataset.

Returns four tuples, giving the shape of the X, y, w, and ids arrays.

Returns:

The tuple contains four elements, which are the shapes of the X, y, w, and ids arrays.

Return type:

Tuple

get_task_names() ndarray[source]

Get the names of the tasks associated with this dataset.

property X: ndarray[source]

Get the X vector for this dataset as a single numpy array.

Returns:

A numpy array of identifiers X.

Return type:

np.ndarray

Note

If data is stored on disk, accessing this field may involve loading data from disk and could potentially be slow. Using iterbatches() or itersamples() may be more efficient for larger datasets.

property y: ndarray[source]

Get the y vector for this dataset as a single numpy array.

Returns:

A numpy array of identifiers y.

Return type:

np.ndarray

Note

If data is stored on disk, accessing this field may involve loading data from disk and could potentially be slow. Using iterbatches() or itersamples() may be more efficient for larger datasets.

property ids: ndarray[source]

Get the ids vector for this dataset as a single numpy array.

Returns:

A numpy array of identifiers ids.

Return type:

np.ndarray

Note

If data is stored on disk, accessing this field may involve loading data from disk and could potentially be slow. Using iterbatches() or itersamples() may be more efficient for larger datasets.

property w: ndarray[source]

Get the weight vector for this dataset as a single numpy array.

Returns:

A numpy array of weights w.

Return type:

np.ndarray

Note

If data is stored on disk, accessing this field may involve loading data from disk and could potentially be slow. Using iterbatches() or itersamples() may be more efficient for larger datasets.

iterbatches(batch_size: int | None = None, epochs: int = 1, deterministic: bool = False, pad_batches: bool = False) Iterator[Tuple[ndarray, ndarray, ndarray, ndarray]][source]

Get an object that iterates over minibatches from the dataset.

Each minibatch is returned as a tuple of four numpy arrays: (X, y, w, ids).

Parameters:
  • batch_size (int, optional (default None)) – Number of elements in each batch.

  • epochs (int, optional (default 1)) – Number of epochs to walk over dataset.

  • deterministic (bool, optional (default False)) – If True, follow deterministic order.

  • pad_batches (bool, optional (default False)) – If True, pad each batch to batch_size.

Returns:

Generator which yields tuples of four numpy arrays (X, y, w, ids).

Return type:

Iterator[Batch]

itersamples() Iterator[Tuple[ndarray, ndarray, ndarray, ndarray]][source]

Get an object that iterates over the samples in the dataset.

Examples

>>> dataset = NumpyDataset(np.ones((2,2)))
>>> for x, y, w, id in dataset.itersamples():
...   print(x.tolist(), y.tolist(), w.tolist(), id)
[1.0, 1.0] [0.0] [0.0] 0
[1.0, 1.0] [0.0] [0.0] 1
transform(transformer: Transformer, **args) Dataset[source]

Construct a new dataset by applying a transformation to every sample in this dataset.

The argument is a function that can be called as follows: >> newx, newy, neww = fn(x, y, w)

It might be called only once with the whole dataset, or multiple times with different subsets of the data. Each time it is called, it should transform the samples and return the transformed data.

Parameters:

transformer (dc.trans.Transformer) – The transformation to apply to each sample in the dataset.

Returns:

A newly constructed Dataset object.

Return type:

Dataset

select(indices: Sequence[int] | ndarray, select_dir: str | None = None) Dataset[source]

Creates a new dataset from a selection of indices from self.

Parameters:
  • indices (Sequence) – List of indices to select.

  • select_dir (str, optional (default None)) – Path to new directory that the selected indices will be copied to.

get_statistics(X_stats: bool = True, y_stats: bool = True) Tuple[ndarray, ...][source]

Compute and return statistics of this dataset.

Uses self.itersamples() to compute means and standard deviations of the dataset. Can compute on large datasets that don’t fit in memory.

Parameters:
  • X_stats (bool, optional (default True)) – If True, compute feature-level mean and standard deviations.

  • y_stats (bool, optional (default True)) – If True, compute label-level mean and standard deviations.

Returns:

  • If X_stats == True, returns (X_means, X_stds).

  • If y_stats == True, returns (y_means, y_stds).

  • If both are true, returns (X_means, X_stds, y_means, y_stds).

Return type:

Tuple

make_tf_dataset(batch_size: int = 100, epochs: int = 1, deterministic: bool = False, pad_batches: bool = False)[source]

Create a tf.data.Dataset that iterates over the data in this Dataset.

Each value returned by the Dataset’s iterator is a tuple of (X, y, w) for one batch.

Parameters:
  • batch_size (int, default 100) – The number of samples to include in each batch.

  • epochs (int, default 1) – The number of times to iterate over the Dataset.

  • deterministic (bool, default False) – If True, the data is produced in order. If False, a different random permutation of the data is used for each epoch.

  • pad_batches (bool, default False) – If True, batches are padded as necessary to make the size of each batch exactly equal batch_size.

Returns:

TensorFlow Dataset that iterates over the same data.

Return type:

tf.data.Dataset

Note

This class requires TensorFlow to be installed.

make_pytorch_dataset(epochs: int = 1, deterministic: bool = False, batch_size: int | None = None)[source]

Create a torch.utils.data.IterableDataset that iterates over the data in this Dataset.

Each value returned by the Dataset’s iterator is a tuple of (X, y, w, id) containing the data for one batch, or for a single sample if batch_size is None.

Parameters:
  • epochs (int, default 1) – The number of times to iterate over the Dataset.

  • deterministic (bool, default False) – If True, the data is produced in order. If False, a different random permutation of the data is used for each epoch.

  • batch_size (int, optional (default None)) – The number of samples to return in each batch. If None, each returned value is a single sample.

Returns:

torch.utils.data.IterableDataset that iterates over the data in this dataset.

Return type:

torch.utils.data.IterableDataset

Note

This class requires PyTorch to be installed.

to_dataframe() DataFrame[source]

Construct a pandas DataFrame containing the data from this Dataset.

Returns:

Pandas dataframe. If there is only a single feature per datapoint, will have column “X” else will have columns “X1,X2,…” for features. If there is only a single label per datapoint, will have column “y” else will have columns “y1,y2,…” for labels. If there is only a single weight per datapoint will have column “w” else will have columns “w1,w2,…”. Will have column “ids” for identifiers.

Return type:

pd.DataFrame

static from_dataframe(df: DataFrame, X: str | Sequence[str] | None = None, y: str | Sequence[str] | None = None, w: str | Sequence[str] | None = None, ids: str | None = None)[source]

Construct a Dataset from the contents of a pandas DataFrame.

Parameters:
  • df (pd.DataFrame) – The pandas DataFrame

  • X (str or List[str], optional (default None)) – The name of the column or columns containing the X array. If this is None, it will look for default column names that match those produced by to_dataframe().

  • y (str or List[str], optional (default None)) – The name of the column or columns containing the y array. If this is None, it will look for default column names that match those produced by to_dataframe().

  • w (str or List[str], optional (default None)) – The name of the column or columns containing the w array. If this is None, it will look for default column names that match those produced by to_dataframe().

  • ids (str, optional (default None)) – The name of the column containing the ids. If this is None, it will look for default column names that match those produced by to_dataframe().

to_csv(path: str) None[source]

Write object to a comma-seperated values (CSV) file

Example

>>> import numpy as np
>>> X = np.random.rand(10, 10)
>>> dataset = dc.data.DiskDataset.from_numpy(X)
>>> dataset.to_csv('out.csv')  
Parameters:

path (str) – File path or object

Return type:

None

DataLoader

The dc.data.DataLoader class is the abstract parent class for all dataloaders. This class should never be directly initialized, but contains a number of useful method implementations.

class DataLoader(tasks: List[str], featurizer: Featurizer, id_field: str | None = None, log_every_n: int = 1000)[source]

Handles loading/featurizing of data from disk.

The main use of DataLoader and its child classes is to make it easier to load large datasets into Dataset objects.`

DataLoader is an abstract superclass that provides a general framework for loading data into DeepChem. This class should never be instantiated directly. To load your own type of data, make a subclass of DataLoader and provide your own implementation for the create_dataset() method.

To construct a Dataset from input data, first instantiate a concrete data loader (that is, an object which is an instance of a subclass of DataLoader) with a given Featurizer object. Then call the data loader’s create_dataset() method on a list of input files that hold the source data to process. Note that each subclass of DataLoader is specialized to handle one type of input data so you will have to pick the loader class suitable for your input data type.

Note that it isn’t necessary to use a data loader to process input data. You can directly use Featurizer objects to featurize provided input into numpy arrays, but note that this calculation will be performed in memory, so you will have to write generators that walk the source files and write featurized data to disk yourself. DataLoader and its subclasses make this process easier for you by performing this work under the hood.

__init__(tasks: List[str], featurizer: Featurizer, id_field: str | None = None, log_every_n: int = 1000)[source]

Construct a DataLoader object.

This constructor is provided as a template mainly. You shouldn’t ever call this constructor directly as a user.

Parameters:
  • tasks (List[str]) – List of task names

  • featurizer (Featurizer) – Featurizer to use to process data.

  • id_field (str, optional (default None)) – Name of field that holds sample identifier. Note that the meaning of “field” depends on the input data type and can have a different meaning in different subclasses. For example, a CSV file could have a field as a column, and an SDF file could have a field as molecular property.

  • log_every_n (int, optional (default 1000)) – Writes a logging statement this often.

featurize(inputs: Any | Sequence[Any], data_dir: str | None = None, shard_size: int | None = 8192) Dataset[source]

Featurize provided files and write to specified location.

DEPRECATED: This method is now a wrapper for create_dataset() and calls that method under the hood.

For large datasets, automatically shards into smaller chunks for convenience. This implementation assumes that the helper methods _get_shards and _featurize_shard are implemented and that each shard returned by _get_shards is a pandas dataframe. You may choose to reuse or override this method in your subclass implementations.

Parameters:
  • inputs (List) – List of inputs to process. Entries can be filenames or arbitrary objects.

  • data_dir (str, default None) – Directory to store featurized dataset.

  • shard_size (int, optional (default 8192)) – Number of examples stored in each shard.

Returns:

A Dataset object containing a featurized representation of data from inputs.

Return type:

Dataset

create_dataset(inputs: Any | Sequence[Any], data_dir: str | None = None, shard_size: int | None = 8192) Dataset[source]

Creates and returns a Dataset object by featurizing provided files.

Reads in inputs and uses self.featurizer to featurize the data in these inputs. For large files, automatically shards into smaller chunks of shard_size datapoints for convenience. Returns a Dataset object that contains the featurized dataset.

This implementation assumes that the helper methods _get_shards and _featurize_shard are implemented and that each shard returned by _get_shards is a pandas dataframe. You may choose to reuse or override this method in your subclass implementations.

Parameters:
  • inputs (List) – List of inputs to process. Entries can be filenames or arbitrary objects.

  • data_dir (str, optional (default None)) – Directory to store featurized dataset.

  • shard_size (int, optional (default 8192)) – Number of examples stored in each shard.

Returns:

A DiskDataset object containing a featurized representation of data from inputs.

Return type:

DiskDataset