Datasets

DeepChem dc.data.Dataset objects are one of the core building blocks of DeepChem programs. Dataset objects hold representations of data for machine learning and are widely used throughout DeepChem.

Dataset

The dc.data.Dataset class is the abstract parent class for all datasets. This class should never be directly initialized, but contains a number of useful method implementations.

The goal of the Dataset class is to be maximally interoperable with other common representations of machine learning datasets. For this reason we provide interconversion methods mapping from Dataset objects to pandas dataframes, tensorflow Datasets, and PyTorch datasets.

class deepchem.data.Dataset[source]

Abstract base class for datasets defined by X, y, w elements.

Dataset objects are used to store representations of a dataset as used in a machine learning task. Datasets contain features X, labels y, weights w and identifiers ids. Different subclasses of Dataset may choose to hold X, y, w, ids in memory or on disk.

The Dataset class attempts to provide for strong interoperability with other machine learning representations for datasets. Interconversion methods allow for Dataset objects to be converted to and from numpy arrays, pandas dataframes, tensorflow datasets, and pytorch datasets (only to and not from for pytorch at present).

Note that you can never instantiate a Dataset object directly. Instead you will need to instantiate one of the concrete subclasses.

X

Get the X vector for this dataset as a single numpy array.

Returns:A numpy array of identifiers X.
Return type:np.ndarray

Notes

If data is stored on disk, accessing this field may involve loading data from disk and could potentially be slow. Using iterbatches() or itersamples() may be more efficient for larger datasets.

__init__() → None[source]

Initialize self. See help(type(self)) for accurate signature.

static from_dataframe(df: pandas.core.frame.DataFrame, X: Union[str, Sequence[str], None] = None, y: Union[str, Sequence[str], None] = None, w: Union[str, Sequence[str], None] = None, ids: Optional[str] = None)[source]

Construct a Dataset from the contents of a pandas DataFrame.

Parameters:
  • df (pd.DataFrame) – The pandas DataFrame
  • X (str or List[str], optional (default None)) – The name of the column or columns containing the X array. If this is None, it will look for default column names that match those produced by to_dataframe().
  • y (str or List[str], optional (default None)) – The name of the column or columns containing the y array. If this is None, it will look for default column names that match those produced by to_dataframe().
  • w (str or List[str], optional (default None)) – The name of the column or columns containing the w array. If this is None, it will look for default column names that match those produced by to_dataframe().
  • ids (str, optional (default None)) – The name of the column containing the ids. If this is None, it will look for default column names that match those produced by to_dataframe().
get_shape() → Tuple[Tuple[int, ...], Tuple[int, ...], Tuple[int, ...], Tuple[int, ...]][source]

Get the shape of the dataset.

Returns four tuples, giving the shape of the X, y, w, and ids arrays.

Returns:The tuple contains four elements, which are the shapes of the X, y, w, and ids arrays.
Return type:Tuple
get_statistics(X_stats: bool = True, y_stats: bool = True) → Tuple[float, ...][source]

Compute and return statistics of this dataset.

Uses self.itersamples() to compute means and standard deviations of the dataset. Can compute on large datasets that don’t fit in memory.

Parameters:
  • X_stats (bool, optional (default True)) – If True, compute feature-level mean and standard deviations.
  • y_stats (bool, optional (default True)) – If True, compute label-level mean and standard deviations.
Returns:

If X_stats == True, returns (X_means, X_stds). If y_stats == True, returns (y_means, y_stds). If both are true, returns (X_means, X_stds, y_means, y_stds).

Return type:

Tuple

get_task_names() → numpy.ndarray[source]

Get the names of the tasks associated with this dataset.

ids

Get the ids vector for this dataset as a single numpy array.

Returns:A numpy array of identifiers ids.
Return type:np.ndarray

Notes

If data is stored on disk, accessing this field may involve loading data from disk and could potentially be slow. Using iterbatches() or itersamples() may be more efficient for larger datasets.

iterbatches(batch_size: Optional[int] = None, epochs: int = 1, deterministic: bool = False, pad_batches: bool = False) → Iterator[Tuple[numpy.ndarray, numpy.ndarray, numpy.ndarray, numpy.ndarray]][source]

Get an object that iterates over minibatches from the dataset.

Each minibatch is returned as a tuple of four numpy arrays: (X, y, w, ids).

Parameters:
  • batch_size (int, optional (default None)) – Number of elements in each batch.
  • epochs (int, optional (default 1)) – Number of epochs to walk over dataset.
  • deterministic (bool, optional (default False)) – If True, follow deterministic order.
  • pad_batches (bool, optional (default False)) – If True, pad each batch to batch_size.
Returns:

Generator which yields tuples of four numpy arrays (X, y, w, ids).

Return type:

Iterator[Batch]

itersamples() → Iterator[Tuple[numpy.ndarray, numpy.ndarray, numpy.ndarray, numpy.ndarray]][source]

Get an object that iterates over the samples in the dataset.

Examples

>>> dataset = NumpyDataset(np.ones((2,2)))
>>> for x, y, w, id in dataset.itersamples():
...   print(x.tolist(), y.tolist(), w.tolist(), id)
[1.0, 1.0] [0.0] [0.0] 0
[1.0, 1.0] [0.0] [0.0] 1
make_pytorch_dataset(epochs: int = 1, deterministic: bool = False, batch_size: Optional[int] = None)[source]

Create a torch.utils.data.IterableDataset that iterates over the data in this Dataset.

Each value returned by the Dataset’s iterator is a tuple of (X, y, w, id) containing the data for one batch, or for a single sample if batch_size is None.

Parameters:
  • epochs (int, default 1) – The number of times to iterate over the Dataset.
  • deterministic (bool, default False) – If True, the data is produced in order. If False, a different random permutation of the data is used for each epoch.
  • batch_size (int, optional (default None)) – The number of samples to return in each batch. If None, each returned value is a single sample.
Returns:

torch.utils.data.IterableDataset that iterates over the data in this dataset.

Return type:

torch.utils.data.IterableDataset

Notes

This class requires PyTorch to be installed.

make_tf_dataset(batch_size: int = 100, epochs: int = 1, deterministic: bool = False, pad_batches: bool = False)[source]

Create a tf.data.Dataset that iterates over the data in this Dataset.

Each value returned by the Dataset’s iterator is a tuple of (X, y, w) for one batch.

Parameters:
  • batch_size (int, default 100) – The number of samples to include in each batch.
  • epochs (int, default 1) – The number of times to iterate over the Dataset.
  • deterministic (bool, default False) – If True, the data is produced in order. If False, a different random permutation of the data is used for each epoch.
  • pad_batches (bool, default False) – If True, batches are padded as necessary to make the size of each batch exactly equal batch_size.
Returns:

TensorFlow Dataset that iterates over the same data.

Return type:

tf.data.Dataset

Notes

This class requires TensorFlow to be installed.

select(indices: Sequence[int], select_dir: Optional[str] = None) → deepchem.data.datasets.Dataset[source]

Creates a new dataset from a selection of indices from self.

Parameters:
  • indices (Sequence) – List of indices to select.
  • select_dir (str, optional (default None)) – Path to new directory that the selected indices will be copied to.
to_dataframe() → pandas.core.frame.DataFrame[source]

Construct a pandas DataFrame containing the data from this Dataset.

Returns:Pandas dataframe. If there is only a single feature per datapoint, will have column “X” else will have columns “X1,X2,…” for features. If there is only a single label per datapoint, will have column “y” else will have columns “y1,y2,…” for labels. If there is only a single weight per datapoint will have column “w” else will have columns “w1,w2,…”. Will have column “ids” for identifiers.
Return type:pd.DataFrame
transform(transformer: transformers.Transformer, **args) → deepchem.data.datasets.Dataset[source]

Construct a new dataset by applying a transformation to every sample in this dataset.

The argument is a function that can be called as follows: >> newx, newy, neww = fn(x, y, w)

It might be called only once with the whole dataset, or multiple times with different subsets of the data. Each time it is called, it should transform the samples and return the transformed data.

Parameters:transformer (dc.trans.Transformer) – The transformation to apply to each sample in the dataset.
Returns:A newly constructed Dataset object.
Return type:Dataset
w

Get the weight vector for this dataset as a single numpy array.

Returns:A numpy array of weights w.
Return type:np.ndarray

Notes

If data is stored on disk, accessing this field may involve loading data from disk and could potentially be slow. Using iterbatches() or itersamples() may be more efficient for larger datasets.

y

Get the y vector for this dataset as a single numpy array.

Returns:A numpy array of identifiers y.
Return type:np.ndarray

Notes

If data is stored on disk, accessing this field may involve loading data from disk and could potentially be slow. Using iterbatches() or itersamples() may be more efficient for larger datasets.

NumpyDataset

The dc.data.NumpyDataset class provides an in-memory implementation of the abstract Dataset which stores its data in numpy.ndarray objects.

class deepchem.data.NumpyDataset(X: numpy.ndarray, y: Optional[numpy.ndarray] = None, w: Optional[numpy.ndarray] = None, ids: Optional[numpy.ndarray] = None, n_tasks: int = 1)[source]

A Dataset defined by in-memory numpy arrays.

This subclass of Dataset stores arrays X,y,w,ids in memory as numpy arrays. This makes it very easy to construct NumpyDataset objects.

Examples

>>> import numpy as np
>>> dataset = NumpyDataset(X=np.random.rand(5, 3), y=np.random.rand(5,), ids=np.arange(5))
X

Get the X vector for this dataset as a single numpy array.

__init__(X: numpy.ndarray, y: Optional[numpy.ndarray] = None, w: Optional[numpy.ndarray] = None, ids: Optional[numpy.ndarray] = None, n_tasks: int = 1) → None[source]

Initialize this object.

Parameters:
  • X (np.ndarray) – Input features. A numpy array of shape (n_samples,…).
  • y (np.ndarray, optional (default None)) – Labels. A numpy array of shape (n_samples, …). Note that each label can have an arbitrary shape.
  • w (np.ndarray, optional (default None)) – Weights. Should either be 1D array of shape (n_samples,) or if there’s more than one task, of shape (n_samples, n_tasks).
  • ids (np.ndarray, optional (default None)) – Identifiers. A numpy array of shape (n_samples,)
  • n_tasks (int, default 1) – Number of learning tasks.
static from_DiskDataset(ds: deepchem.data.datasets.DiskDataset) → deepchem.data.datasets.NumpyDataset[source]

Convert DiskDataset to NumpyDataset.

Parameters:ds (DiskDataset) – DiskDataset to transform to NumpyDataset.
Returns:A new NumpyDataset created from DiskDataset.
Return type:NumpyDataset
static from_dataframe(df: pandas.core.frame.DataFrame, X: Union[str, Sequence[str], None] = None, y: Union[str, Sequence[str], None] = None, w: Union[str, Sequence[str], None] = None, ids: Optional[str] = None)[source]

Construct a Dataset from the contents of a pandas DataFrame.

Parameters:
  • df (pd.DataFrame) – The pandas DataFrame
  • X (str or List[str], optional (default None)) – The name of the column or columns containing the X array. If this is None, it will look for default column names that match those produced by to_dataframe().
  • y (str or List[str], optional (default None)) – The name of the column or columns containing the y array. If this is None, it will look for default column names that match those produced by to_dataframe().
  • w (str or List[str], optional (default None)) – The name of the column or columns containing the w array. If this is None, it will look for default column names that match those produced by to_dataframe().
  • ids (str, optional (default None)) – The name of the column containing the ids. If this is None, it will look for default column names that match those produced by to_dataframe().
static from_json(fname: str) → deepchem.data.datasets.NumpyDataset[source]

Create NumpyDataset from the json file.

Parameters:fname (str) – The name of the json file.
Returns:A new NumpyDataset created from the json file.
Return type:NumpyDataset
get_shape() → Tuple[Tuple[int, ...], Tuple[int, ...], Tuple[int, ...], Tuple[int, ...]][source]

Get the shape of the dataset.

Returns four tuples, giving the shape of the X, y, w, and ids arrays.

get_statistics(X_stats: bool = True, y_stats: bool = True) → Tuple[float, ...][source]

Compute and return statistics of this dataset.

Uses self.itersamples() to compute means and standard deviations of the dataset. Can compute on large datasets that don’t fit in memory.

Parameters:
  • X_stats (bool, optional (default True)) – If True, compute feature-level mean and standard deviations.
  • y_stats (bool, optional (default True)) – If True, compute label-level mean and standard deviations.
Returns:

If X_stats == True, returns (X_means, X_stds). If y_stats == True, returns (y_means, y_stds). If both are true, returns (X_means, X_stds, y_means, y_stds).

Return type:

Tuple

get_task_names() → numpy.ndarray[source]

Get the names of the tasks associated with this dataset.

ids

Get the ids vector for this dataset as a single numpy array.

iterbatches(batch_size: Optional[int] = None, epochs: int = 1, deterministic: bool = False, pad_batches: bool = False) → Iterator[Tuple[numpy.ndarray, numpy.ndarray, numpy.ndarray, numpy.ndarray]][source]

Get an object that iterates over minibatches from the dataset.

Each minibatch is returned as a tuple of four numpy arrays: (X, y, w, ids).

Parameters:
  • batch_size (int, optional (default None)) – Number of elements in each batch.
  • epochs (int, default 1) – Number of epochs to walk over dataset.
  • deterministic (bool, optional (default False)) – If True, follow deterministic order.
  • pad_batches (bool, optional (default False)) – If True, pad each batch to batch_size.
Returns:

Generator which yields tuples of four numpy arrays (X, y, w, ids).

Return type:

Iterator[Batch]

itersamples() → Iterator[Tuple[numpy.ndarray, numpy.ndarray, numpy.ndarray, numpy.ndarray]][source]

Get an object that iterates over the samples in the dataset.

Returns:Iterator which yields tuples of four numpy arrays (X, y, w, ids).
Return type:Iterator[Batch]

Examples

>>> dataset = NumpyDataset(np.ones((2,2)))
>>> for x, y, w, id in dataset.itersamples():
...   print(x.tolist(), y.tolist(), w.tolist(), id)
[1.0, 1.0] [0.0] [0.0] 0
[1.0, 1.0] [0.0] [0.0] 1
make_pytorch_dataset(epochs: int = 1, deterministic: bool = False, batch_size: Optional[int] = None)[source]

Create a torch.utils.data.IterableDataset that iterates over the data in this Dataset.

Each value returned by the Dataset’s iterator is a tuple of (X, y, w, id) containing the data for one batch, or for a single sample if batch_size is None.

Parameters:
  • epochs (int, default 1) – The number of times to iterate over the Dataset
  • deterministic (bool, default False) – If True, the data is produced in order. If False, a different random permutation of the data is used for each epoch.
  • batch_size (int, optional (default None)) – The number of samples to return in each batch. If None, each returned value is a single sample.
Returns:

torch.utils.data.IterableDataset that iterates over the data in this dataset.

Return type:

torch.utils.data.IterableDataset

Notes

This method requires PyTorch to be installed.

make_tf_dataset(batch_size: int = 100, epochs: int = 1, deterministic: bool = False, pad_batches: bool = False)[source]

Create a tf.data.Dataset that iterates over the data in this Dataset.

Each value returned by the Dataset’s iterator is a tuple of (X, y, w) for one batch.

Parameters:
  • batch_size (int, default 100) – The number of samples to include in each batch.
  • epochs (int, default 1) – The number of times to iterate over the Dataset.
  • deterministic (bool, default False) – If True, the data is produced in order. If False, a different random permutation of the data is used for each epoch.
  • pad_batches (bool, default False) – If True, batches are padded as necessary to make the size of each batch exactly equal batch_size.
Returns:

TensorFlow Dataset that iterates over the same data.

Return type:

tf.data.Dataset

Notes

This class requires TensorFlow to be installed.

static merge(datasets: Sequence[deepchem.data.datasets.Dataset]) → deepchem.data.datasets.NumpyDataset[source]

Merge multiple NumpyDatasets.

Parameters:datasets (List[Dataset]) – List of datasets to merge.
Returns:A single NumpyDataset containing all the samples from all datasets.
Return type:NumpyDataset
select(indices: Sequence[int], select_dir: Optional[str] = None) → deepchem.data.datasets.NumpyDataset[source]

Creates a new dataset from a selection of indices from self.

Parameters:
  • indices (List[int]) – List of indices to select.
  • select_dir (str, optional (default None)) – Used to provide same API as DiskDataset. Ignored since NumpyDataset is purely in-memory.
Returns:

A selected NumpyDataset object

Return type:

NumpyDataset

to_dataframe() → pandas.core.frame.DataFrame[source]

Construct a pandas DataFrame containing the data from this Dataset.

Returns:Pandas dataframe. If there is only a single feature per datapoint, will have column “X” else will have columns “X1,X2,…” for features. If there is only a single label per datapoint, will have column “y” else will have columns “y1,y2,…” for labels. If there is only a single weight per datapoint will have column “w” else will have columns “w1,w2,…”. Will have column “ids” for identifiers.
Return type:pd.DataFrame
static to_json(self, fname: str) → None[source]

Dump NumpyDataset to the json file .

Parameters:fname (str) – The name of the json file.
transform(transformer: transformers.Transformer, **args) → deepchem.data.datasets.NumpyDataset[source]

Construct a new dataset by applying a transformation to every sample in this dataset.

The argument is a function that can be called as follows: >> newx, newy, neww = fn(x, y, w)

It might be called only once with the whole dataset, or multiple times with different subsets of the data. Each time it is called, it should transform the samples and return the transformed data.

Parameters:transformer (dc.trans.Transformer) – The transformation to apply to each sample in the dataset
Returns:A newly constructed NumpyDataset object
Return type:NumpyDataset
w

Get the weight vector for this dataset as a single numpy array.

y

Get the y vector for this dataset as a single numpy array.

DiskDataset

The dc.data.DiskDataset class allows for the storage of larger datasets on disk. Each DiskDataset is associated with a directory in which it writes its contents to disk. Note that a DiskDataset can be very large, so some of the utility methods to access fields of a Dataset can be prohibitively expensive.

class deepchem.data.DiskDataset(data_dir: str)[source]

A Dataset that is stored as a set of files on disk.

The DiskDataset is the workhorse class of DeepChem that facilitates analyses on large datasets. Use this class whenever you’re working with a large dataset that can’t be easily manipulated in RAM.

On disk, a DiskDataset has a simple structure. All files for a given DiskDataset are stored in a data_dir. The contents of data_dir should be laid out as follows:

data_dir/

—> metadata.csv.gzip | —> tasks.json | —> shard-0-X.npy | —> shard-0-y.npy | —> shard-0-w.npy | —> shard-0-ids.npy | —> shard-1-X.npy . . .

The metadata is constructed by static method DiskDataset._construct_metadata and saved to disk by DiskDataset._save_metadata. The metadata itself consists of a csv file which has columns (‘ids’, ‘X’, ‘y’, ‘w’, ‘ids_shape’, ‘X_shape’, ‘y_shape’, ‘w_shape’). tasks.json consists of a list of task names for this dataset.

The actual data is stored in .npy files (numpy array files) of the form ‘shard-0-X.npy’, ‘shard-0-y.npy’, etc.

The basic structure of DiskDataset is quite robust and will likely serve you well for datasets up to about 100 GB or larger. However note that DiskDataset has not been tested for very large datasets at the terabyte range and beyond. You may be better served by implementing a custom Dataset class for those use cases.

Examples

Let’s walk through a simple example of constructing a new DiskDataset.

>>> import deepchem as dc
>>> import numpy as np
>>> X = np.random.rand(10, 10)
>>> dataset = dc.data.DiskDataset.from_numpy(X)

If you have already saved a DiskDataset to data_dir, you can reinitialize it with

>> data_dir = “/path/to/my/data” >> dataset = dc.data.DiskDataset(data_dir)

Once you have a dataset you can access its attributes as follows

>>> X = np.random.rand(10, 10)
>>> y = np.random.rand(10,)
>>> w = np.ones_like(y)
>>> dataset = dc.data.DiskDataset.from_numpy(X)
>>> X, y, w = dataset.X, dataset.y, dataset.w

One thing to beware of is that dataset.X, dataset.y, dataset.w are loading data from disk! If you have a large dataset, these operations can be extremely slow. Instead try iterating through the dataset instead.

>>> for (xi, yi, wi, idi) in dataset.itersamples():
...   pass
data_dir

Location of directory where this DiskDataset is stored to disk

Type:str
metadata_df

Pandas Dataframe holding metadata for this DiskDataset

Type:pd.DataFrame
legacy_metadata

Whether this DiskDataset uses legacy format.

Type:bool

Notes

DiskDataset originally had a simpler metadata format without shape information. Older DiskDataset objects had metadata files with columns (‘ids’, ‘X’, ‘y’, ‘w’) and not additional shape columns. `DiskDataset maintains backwards compatibility with this older metadata format, but we recommend for performance reasons not using legacy metadata for new projects.

X

Get the X vector for this dataset as a single numpy array.

__init__(data_dir: str) → None[source]

Load a constructed DiskDataset from disk

Note that this method cannot construct a new disk dataset. Instead use static methods DiskDataset.create_dataset or DiskDataset.from_numpy for that purpose. Use this constructor instead to load a DiskDataset that has already been created on disk.

Parameters:data_dir (str) – Location on disk of an existing DiskDataset.
add_shard(X: numpy.ndarray, y: Optional[numpy.ndarray] = None, w: Optional[numpy.ndarray] = None, ids: Optional[numpy.ndarray] = None) → None[source]

Adds a data shard.

Parameters:
  • X (np.ndarray) – Feature array.
  • y (np.ndarray, optioanl (default None)) – Labels array.
  • w (np.ndarray, optioanl (default None)) – Weights array.
  • ids (np.ndarray, optioanl (default None)) – Identifiers array.
complete_shuffle(data_dir: Optional[str] = None) → deepchem.data.datasets.Dataset[source]

Completely shuffle across all data, across all shards.

Notes

The algorithm used for this complete shuffle is O(N^2) where N is the number of shards. It simply constructs each shard of the output dataset one at a time. Since the complete shuffle can take a long time, it’s useful to watch the logging output. Each shuffled shard is constructed using select() which logs as it selects from each original shard. This will results in O(N^2) logging statements, one for each extraction of shuffled shard i’s contributions from original shard j.

Parameters:data_dir (Optional[str], (default None)) – Directory to write the shuffled dataset to. If none is specified a temporary directory will be used.
Returns:A DiskDataset whose data is a randomly shuffled version of this dataset.
Return type:DiskDataset
copy(new_data_dir: str) → deepchem.data.datasets.DiskDataset[source]

Copies dataset to new directory.

Parameters:new_data_dir (str) – The new directory name to copy this to dataset to.
Returns:A copied DiskDataset object.
Return type:DiskDataset

Notes

This is a stateful operation! Any data at new_data_dir will be deleted and self.data_dir will be deep copied into new_data_dir.

static create_dataset(shard_generator: Iterable[Tuple[numpy.ndarray, numpy.ndarray, numpy.ndarray, numpy.ndarray]], data_dir: Optional[str] = None, tasks: Optional[Sequence[T_co]] = []) → deepchem.data.datasets.DiskDataset[source]

Creates a new DiskDataset

Parameters:
  • shard_generator (Iterable[Batch]) – An iterable (either a list or generator) that provides tuples of data (X, y, w, ids). Each tuple will be written to a separate shard on disk.
  • data_dir (str, optional (default None)) – Filename for data directory. Creates a temp directory if none specified.
  • tasks (Sequence, optional (default [])) – List of tasks for this dataset.
Returns:

A new DiskDataset constructed from the given data

Return type:

DiskDataset

static from_dataframe(df: pandas.core.frame.DataFrame, X: Union[str, Sequence[str], None] = None, y: Union[str, Sequence[str], None] = None, w: Union[str, Sequence[str], None] = None, ids: Optional[str] = None)[source]

Construct a Dataset from the contents of a pandas DataFrame.

Parameters:
  • df (pd.DataFrame) – The pandas DataFrame
  • X (str or List[str], optional (default None)) – The name of the column or columns containing the X array. If this is None, it will look for default column names that match those produced by to_dataframe().
  • y (str or List[str], optional (default None)) – The name of the column or columns containing the y array. If this is None, it will look for default column names that match those produced by to_dataframe().
  • w (str or List[str], optional (default None)) – The name of the column or columns containing the w array. If this is None, it will look for default column names that match those produced by to_dataframe().
  • ids (str, optional (default None)) – The name of the column containing the ids. If this is None, it will look for default column names that match those produced by to_dataframe().
static from_numpy(X: numpy.ndarray, y: Optional[numpy.ndarray] = None, w: Optional[numpy.ndarray] = None, ids: Optional[numpy.ndarray] = None, tasks: Optional[Sequence[T_co]] = None, data_dir: Optional[str] = None) → deepchem.data.datasets.DiskDataset[source]

Creates a DiskDataset object from specified Numpy arrays.

Parameters:
  • X (np.ndarray) – Feature array.
  • y (np.ndarray, optional (default None)) – Labels array.
  • w (np.ndarray, optional (default None)) – Weights array.
  • ids (np.ndarray, optional (default None)) – Identifiers array.
  • tasks (Sequence, optional (default None)) – Tasks in this dataset
  • data_dir (str, optional (default None)) – The directory to write this dataset to. If none is specified, will use a temporary directory instead.
Returns:

A new DiskDataset constructed from the provided information.

Return type:

DiskDataset

get_data_shape() → Tuple[int, ...][source]

Gets array shape of datapoints in this dataset.

get_label_means() → pandas.core.frame.DataFrame[source]

Return pandas series of label means.

get_label_stds() → pandas.core.frame.DataFrame[source]

Return pandas series of label stds.

get_number_shards() → int[source]

Returns the number of shards for this dataset.

get_shape() → Tuple[Tuple[int, ...], Tuple[int, ...], Tuple[int, ...], Tuple[int, ...]][source]

Finds shape of dataset.

Returns four tuples, giving the shape of the X, y, w, and ids arrays.

get_shard(i: int) → Tuple[numpy.ndarray, numpy.ndarray, numpy.ndarray, numpy.ndarray][source]

Retrieves data for the i-th shard from disk.

Parameters:i (int) – Shard index for shard to retrieve batch from.
Returns:A batch data for i-th shard.
Return type:Batch
get_shard_ids(i: int) → numpy.ndarray[source]

Retrieves the list of IDs for the i-th shard from disk.

Parameters:i (int) – Shard index for shard to retrieve weights from.
Returns:A numpy array of ids for i-th shard.
Return type:np.ndarray
get_shard_size() → int[source]

Gets size of shards on disk.

get_shard_w(i: int) → numpy.ndarray[source]

Retrieves the weights for the i-th shard from disk.

Parameters:i (int) – Shard index for shard to retrieve weights from.
Returns:A numpy array of weights for i-th shard.
Return type:np.ndarray
get_shard_y(i: int) → numpy.ndarray[source]

Retrieves the labels for the i-th shard from disk.

Parameters:i (int) – Shard index for shard to retrieve labels from.
Returns:A numpy array of labels for i-th shard.
Return type:np.ndarray
get_statistics(X_stats: bool = True, y_stats: bool = True) → Tuple[float, ...][source]

Compute and return statistics of this dataset.

Uses self.itersamples() to compute means and standard deviations of the dataset. Can compute on large datasets that don’t fit in memory.

Parameters:
  • X_stats (bool, optional (default True)) – If True, compute feature-level mean and standard deviations.
  • y_stats (bool, optional (default True)) – If True, compute label-level mean and standard deviations.
Returns:

If X_stats == True, returns (X_means, X_stds). If y_stats == True, returns (y_means, y_stds). If both are true, returns (X_means, X_stds, y_means, y_stds).

Return type:

Tuple

get_task_names() → numpy.ndarray[source]

Gets learning tasks associated with this dataset.

ids

Get the ids vector for this dataset as a single numpy array.

iterbatches(batch_size: Optional[int] = None, epochs: int = 1, deterministic: bool = False, pad_batches: bool = False) → Iterator[Tuple[numpy.ndarray, numpy.ndarray, numpy.ndarray, numpy.ndarray]][source]

Get an object that iterates over minibatches from the dataset.

It is guaranteed that the number of batches returned is math.ceil(len(dataset)/batch_size). Each minibatch is returned as a tuple of four numpy arrays: (X, y, w, ids).

Parameters:
  • batch_size (int, optional (default None)) – Number of elements in a batch. If None, then it yields batches with size equal to the size of each individual shard.
  • epoch (int, default 1) – Number of epochs to walk over dataset
  • deterministic (bool, default False) – Whether or not we should should shuffle each shard before generating the batches. Note that this is only local in the sense that it does not ever mix between different shards.
  • pad_batches (bool, default False) – Whether or not we should pad the last batch, globally, such that it has exactly batch_size elements.
Returns:

Generator which yields tuples of four numpy arrays (X, y, w, ids).

Return type:

Iterator[Batch]

itersamples() → Iterator[Tuple[numpy.ndarray, numpy.ndarray, numpy.ndarray, numpy.ndarray]][source]

Get an object that iterates over the samples in the dataset.

Returns:Generator which yields tuples of four numpy arrays (X, y, w, ids).
Return type:Iterator[Batch]

Examples

>>> dataset = DiskDataset.from_numpy(np.ones((2,2)), np.ones((2,1)))
>>> for x, y, w, id in dataset.itersamples():
...   print(x.tolist(), y.tolist(), w.tolist(), id)
[1.0, 1.0] [1.0] [1.0] 0
[1.0, 1.0] [1.0] [1.0] 1
itershards() → Iterator[Tuple[numpy.ndarray, numpy.ndarray, numpy.ndarray, numpy.ndarray]][source]

Return an object that iterates over all shards in dataset.

Datasets are stored in sharded fashion on disk. Each call to next() for the generator defined by this function returns the data from a particular shard. The order of shards returned is guaranteed to remain fixed.

Returns:Generator which yields tuples of four numpy arrays (X, y, w, ids).
Return type:Iterator[Batch]
load_metadata() → Tuple[List[str], pandas.core.frame.DataFrame][source]

Helper method that loads metadata from disk.

make_pytorch_dataset(epochs: int = 1, deterministic: bool = False, batch_size: Optional[int] = None)[source]

Create a torch.utils.data.IterableDataset that iterates over the data in this Dataset.

Each value returned by the Dataset’s iterator is a tuple of (X, y, w, id) containing the data for one batch, or for a single sample if batch_size is None.

Parameters:
  • epochs (int, default 1) – The number of times to iterate over the Dataset
  • deterministic (bool, default False) – If True, the data is produced in order. If False, a different random permutation of the data is used for each epoch.
  • batch_size (int, optional (default None)) – The number of samples to return in each batch. If None, each returned value is a single sample.
Returns:

torch.utils.data.IterableDataset that iterates over the data in this dataset.

Return type:

torch.utils.data.IterableDataset

Notes

This method requires PyTorch to be installed.

make_tf_dataset(batch_size: int = 100, epochs: int = 1, deterministic: bool = False, pad_batches: bool = False)[source]

Create a tf.data.Dataset that iterates over the data in this Dataset.

Each value returned by the Dataset’s iterator is a tuple of (X, y, w) for one batch.

Parameters:
  • batch_size (int, default 100) – The number of samples to include in each batch.
  • epochs (int, default 1) – The number of times to iterate over the Dataset.
  • deterministic (bool, default False) – If True, the data is produced in order. If False, a different random permutation of the data is used for each epoch.
  • pad_batches (bool, default False) – If True, batches are padded as necessary to make the size of each batch exactly equal batch_size.
Returns:

TensorFlow Dataset that iterates over the same data.

Return type:

tf.data.Dataset

Notes

This class requires TensorFlow to be installed.

memory_cache_size

Get the size of the memory cache for this dataset, measured in bytes.

static merge(datasets: Iterable[Dataset], merge_dir: Optional[str] = None) → deepchem.data.datasets.DiskDataset[source]

Merges provided datasets into a merged dataset.

Parameters:
  • datasets (Iterable[Dataset]) – List of datasets to merge.
  • merge_dir (str, optional (default None)) – The new directory path to store the merged DiskDataset.
Returns:

A merged DiskDataset.

Return type:

DiskDataset

move(new_data_dir: str, delete_if_exists: Optional[bool] = True) → None[source]

Moves dataset to new directory.

Parameters:
  • new_data_dir (str) – The new directory name to move this to dataset to.
  • delete_if_exists (bool, optional (default True)) – If this option is set, delete the destination directory if it exists before moving. This is set to True by default to be backwards compatible with behavior in earlier versions of DeepChem.

Notes

This is a stateful operation! self.data_dir will be moved into new_data_dir. If delete_if_exists is set to True (by default this is set True), then new_data_dir is deleted if it’s a pre-existing directory.

reshard(shard_size: int) → None[source]

Reshards data to have specified shard size.

Parameters:shard_size (int) – The size of shard.

Examples

>>> import deepchem as dc
>>> import numpy as np
>>> X = np.random.rand(100, 10)
>>> d = dc.data.DiskDataset.from_numpy(X)
>>> d.reshard(shard_size=10)
>>> d.get_number_shards()
10

Notes

If this DiskDataset is in legacy_metadata format, reshard will convert this dataset to have non-legacy metadata.

save_to_disk() → None[source]

Save dataset to disk.

select(indices: Sequence[int], select_dir: Optional[str] = None, select_shard_size: Optional[int] = None, output_numpy_dataset: Optional[bool] = False) → deepchem.data.datasets.Dataset[source]

Creates a new dataset from a selection of indices from self.

Examples

>>> import numpy as np
>>> X = np.random.rand(10, 10)
>>> dataset = dc.data.DiskDataset.from_numpy(X)
>>> selected = dataset.select([1, 3, 4])
>>> len(selected)
3
Parameters:
  • indices (Sequence) – List of indices to select.
  • select_dir (str, optional (default None)) – Path to new directory that the selected indices will be copied to.
  • select_shard_size (Optional[int], (default None)) – If specified, the shard-size to use for output selected DiskDataset. If not output_numpy_dataset, then this is set to this current dataset’s shard size if not manually specified.
  • output_numpy_dataset (Optional[bool], (default False)) – If True, output an in-memory NumpyDataset instead of a DiskDataset. Note that select_dir and select_shard_size must be None if this is True
Returns:

A dataset containing the selected samples. The default dataset is DiskDataset. If output_numpy_dataset is True, the dataset is NumpyDataset.

Return type:

Dataset

set_shard(shard_num: int, X: numpy.ndarray, y: Optional[numpy.ndarray] = None, w: Optional[numpy.ndarray] = None, ids: Optional[numpy.ndarray] = None) → None[source]

Writes data shard to disk.

Parameters:
  • shard_num (int) – Shard index for shard to set new data.
  • X (np.ndarray) – Feature array.
  • y (np.ndarray, optioanl (default None)) – Labels array.
  • w (np.ndarray, optioanl (default None)) – Weights array.
  • ids (np.ndarray, optioanl (default None)) – Identifiers array.
shuffle_each_shard(shard_basenames: Optional[List[str]] = None) → None[source]

Shuffles elements within each shard of the dataset.

Parameters:shard_basenames (List[str], optional (default None)) – The basenames for each shard. If this isn’t specified, will assume the basenames of form “shard-i” used by create_dataset and reshard.
shuffle_shards() → None[source]

Shuffles the order of the shards for this dataset.

sparse_shuffle() → None[source]

Shuffling that exploits data sparsity to shuffle large datasets.

If feature vectors are sparse, say circular fingerprints or any other representation that contains few nonzero values, it can be possible to exploit the sparsity of the vector to simplify shuffles. This method implements a sparse shuffle by compressing sparse feature vectors down into a compressed representation, then shuffles this compressed dataset in memory and writes the results to disk.

Notes

This method only works for 1-dimensional feature vectors (does not work for tensorial featurizations). Note that this shuffle is performed in place.

subset(shard_nums: Sequence[int], subset_dir: Optional[str] = None) → deepchem.data.datasets.DiskDataset[source]

Creates a subset of the original dataset on disk.

Parameters:
  • shard_nums (Sequence[int]) – The indices of shard to extract from the original DiskDataset.
  • subset_dir (str, optional (default None)) – The new directory path to store the subset DiskDataset.
Returns:

A subset DiskDataset.

Return type:

DiskDataset

to_dataframe() → pandas.core.frame.DataFrame[source]

Construct a pandas DataFrame containing the data from this Dataset.

Returns:Pandas dataframe. If there is only a single feature per datapoint, will have column “X” else will have columns “X1,X2,…” for features. If there is only a single label per datapoint, will have column “y” else will have columns “y1,y2,…” for labels. If there is only a single weight per datapoint will have column “w” else will have columns “w1,w2,…”. Will have column “ids” for identifiers.
Return type:pd.DataFrame
transform(transformer: transformers.Transformer, parallel: bool = False, out_dir: Optional[str] = None, **args) → deepchem.data.datasets.DiskDataset[source]

Construct a new dataset by applying a transformation to every sample in this dataset.

The argument is a function that can be called as follows: >> newx, newy, neww = fn(x, y, w)

It might be called only once with the whole dataset, or multiple times with different subsets of the data. Each time it is called, it should transform the samples and return the transformed data.

Parameters:
  • transformer (dc.trans.Transformer) – The transformation to apply to each sample in the dataset.
  • parallel (bool, default False) – If True, use multiple processes to transform the dataset in parallel.
  • out_dir (str, optional (default None)) – The directory to save the new dataset in. If this is omitted, a temporary directory is created automaticall.
Returns:

A newly constructed Dataset object

Return type:

DiskDataset

w

Get the weight vector for this dataset as a single numpy array.

static write_data_to_disk(data_dir: str, basename: str, tasks: numpy.ndarray, X: Optional[numpy.ndarray] = None, y: Optional[numpy.ndarray] = None, w: Optional[numpy.ndarray] = None, ids: Optional[numpy.ndarray] = None) → List[Optional[str]][source]

Static helper method to write data to disk.

This helper method is used to write a shard of data to disk.

Parameters:
  • data_dir (str) – Data directory to write shard to.
  • basename (str) – Basename for the shard in question.
  • tasks (np.ndarray) – The names of the tasks in question.
  • X (np.ndarray, optional (default None)) – The features array.
  • y (np.ndarray, optional (default None)) – The labels array.
  • w (np.ndarray, optional (default None)) – The weights array.
  • ids (np.ndarray, optional (default None)) – The identifiers array.
Returns:

List with values [out_ids, out_X, out_y, out_w, out_ids_shape, out_X_shape, out_y_shape, out_w_shape] with filenames of locations to disk which these respective arrays were written.

Return type:

List[Optional[str]]

y

Get the y vector for this dataset as a single numpy array.

ImageDataset

The dc.data.ImageDataset class is optimized to allow for convenient processing of image based datasets.

class deepchem.data.ImageDataset(X: Union[numpy.ndarray, List[str]], y: Union[numpy.ndarray, List[str], None], w: Optional[numpy.ndarray] = None, ids: Optional[numpy.ndarray] = None)[source]

A Dataset that loads data from image files on disk.

X

Get the X vector for this dataset as a single numpy array.

__init__(X: Union[numpy.ndarray, List[str]], y: Union[numpy.ndarray, List[str], None], w: Optional[numpy.ndarray] = None, ids: Optional[numpy.ndarray] = None) → None[source]

Create a dataset whose X and/or y array is defined by image files on disk.

Parameters:
  • X (np.ndarray or List[str]) – The dataset’s input data. This may be either a single NumPy array directly containing the data, or a list containing the paths to the image files
  • y (np.ndarray or List[str]) – The dataset’s labels. This may be either a single NumPy array directly containing the data, or a list containing the paths to the image files
  • w (np.ndarray, optional (default None)) – a 1D or 2D array containing the weights for each sample or sample/task pair
  • ids (np.ndarray, optional (default None)) – the sample IDs
static from_dataframe(df: pandas.core.frame.DataFrame, X: Union[str, Sequence[str], None] = None, y: Union[str, Sequence[str], None] = None, w: Union[str, Sequence[str], None] = None, ids: Optional[str] = None)[source]

Construct a Dataset from the contents of a pandas DataFrame.

Parameters:
  • df (pd.DataFrame) – The pandas DataFrame
  • X (str or List[str], optional (default None)) – The name of the column or columns containing the X array. If this is None, it will look for default column names that match those produced by to_dataframe().
  • y (str or List[str], optional (default None)) – The name of the column or columns containing the y array. If this is None, it will look for default column names that match those produced by to_dataframe().
  • w (str or List[str], optional (default None)) – The name of the column or columns containing the w array. If this is None, it will look for default column names that match those produced by to_dataframe().
  • ids (str, optional (default None)) – The name of the column containing the ids. If this is None, it will look for default column names that match those produced by to_dataframe().
get_shape() → Tuple[Tuple[int, ...], Tuple[int, ...], Tuple[int, ...], Tuple[int, ...]][source]

Get the shape of the dataset.

Returns four tuples, giving the shape of the X, y, w, and ids arrays.

get_statistics(X_stats: bool = True, y_stats: bool = True) → Tuple[float, ...][source]

Compute and return statistics of this dataset.

Uses self.itersamples() to compute means and standard deviations of the dataset. Can compute on large datasets that don’t fit in memory.

Parameters:
  • X_stats (bool, optional (default True)) – If True, compute feature-level mean and standard deviations.
  • y_stats (bool, optional (default True)) – If True, compute label-level mean and standard deviations.
Returns:

If X_stats == True, returns (X_means, X_stds). If y_stats == True, returns (y_means, y_stds). If both are true, returns (X_means, X_stds, y_means, y_stds).

Return type:

Tuple

get_task_names() → numpy.ndarray[source]

Get the names of the tasks associated with this dataset.

ids

Get the ids vector for this dataset as a single numpy array.

iterbatches(batch_size: Optional[int] = None, epochs: int = 1, deterministic: bool = False, pad_batches: bool = False) → Iterator[Tuple[numpy.ndarray, numpy.ndarray, numpy.ndarray, numpy.ndarray]][source]

Get an object that iterates over minibatches from the dataset.

Each minibatch is returned as a tuple of four numpy arrays: (X, y, w, ids).

Parameters:
  • batch_size (int, optional (default None)) – Number of elements in each batch.
  • epochs (int, default 1) – Number of epochs to walk over dataset.
  • deterministic (bool, default False) – If True, follow deterministic order.
  • pad_batches (bool, default False) – If True, pad each batch to batch_size.
Returns:

Generator which yields tuples of four numpy arrays (X, y, w, ids).

Return type:

Iterator[Batch]

itersamples() → Iterator[Tuple[numpy.ndarray, numpy.ndarray, numpy.ndarray, numpy.ndarray]][source]

Get an object that iterates over the samples in the dataset.

Returns:Iterator which yields tuples of four numpy arrays (X, y, w, ids).
Return type:Iterator[Batch]
make_pytorch_dataset(epochs: int = 1, deterministic: bool = False, batch_size: Optional[int] = None)[source]

Create a torch.utils.data.IterableDataset that iterates over the data in this Dataset.

Each value returned by the Dataset’s iterator is a tuple of (X, y, w, id) containing the data for one batch, or for a single sample if batch_size is None.

Parameters:
  • epochs (int, default 1) – The number of times to iterate over the Dataset.
  • deterministic (bool, default False) – If True, the data is produced in order. If False, a different random permutation of the data is used for each epoch.
  • batch_size (int, optional (default None)) – The number of samples to return in each batch. If None, each returned value is a single sample.
Returns:

torch.utils.data.IterableDataset that iterates over the data in this dataset.

Return type:

torch.utils.data.IterableDataset

Notes

This method requires PyTorch to be installed.

make_tf_dataset(batch_size: int = 100, epochs: int = 1, deterministic: bool = False, pad_batches: bool = False)[source]

Create a tf.data.Dataset that iterates over the data in this Dataset.

Each value returned by the Dataset’s iterator is a tuple of (X, y, w) for one batch.

Parameters:
  • batch_size (int, default 100) – The number of samples to include in each batch.
  • epochs (int, default 1) – The number of times to iterate over the Dataset.
  • deterministic (bool, default False) – If True, the data is produced in order. If False, a different random permutation of the data is used for each epoch.
  • pad_batches (bool, default False) – If True, batches are padded as necessary to make the size of each batch exactly equal batch_size.
Returns:

TensorFlow Dataset that iterates over the same data.

Return type:

tf.data.Dataset

Notes

This class requires TensorFlow to be installed.

select(indices: Sequence[int], select_dir: Optional[str] = None) → deepchem.data.datasets.ImageDataset[source]

Creates a new dataset from a selection of indices from self.

Parameters:
  • indices (Sequence) – List of indices to select.
  • select_dir (str, optional (default None)) – Used to provide same API as DiskDataset. Ignored since ImageDataset is purely in-memory.
Returns:

A selected ImageDataset object

Return type:

ImageDataset

to_dataframe() → pandas.core.frame.DataFrame[source]

Construct a pandas DataFrame containing the data from this Dataset.

Returns:Pandas dataframe. If there is only a single feature per datapoint, will have column “X” else will have columns “X1,X2,…” for features. If there is only a single label per datapoint, will have column “y” else will have columns “y1,y2,…” for labels. If there is only a single weight per datapoint will have column “w” else will have columns “w1,w2,…”. Will have column “ids” for identifiers.
Return type:pd.DataFrame
transform(transformer: transformers.Transformer, **args) → deepchem.data.datasets.NumpyDataset[source]

Construct a new dataset by applying a transformation to every sample in this dataset.

The argument is a function that can be called as follows:

>> newx, newy, neww = fn(x, y, w)

It might be called only once with the whole dataset, or multiple times with different subsets of the data. Each time it is called, it should transform the samples and return the transformed data.

Parameters:transformer (dc.trans.Transformer) – The transformation to apply to each sample in the dataset
Returns:A newly constructed NumpyDataset object
Return type:NumpyDataset
w

Get the weight vector for this dataset as a single numpy array.

y

Get the y vector for this dataset as a single numpy array.