Datasets

DeepChem dc.data.Dataset objects are one of the core building blocks of DeepChem programs. Dataset objects hold representations of data for machine learning and are widely used throughout DeepChem.

Dataset

The dc.data.Dataset class is the abstract parent class for all datasets. This class should never be directly initialized, but contains a number of useful method implementations.

The goal of the Dataset class is to be maximally interoperable with other common representations of machine learning datasets. For this reason we provide interconversion methods mapping from Dataset objects to pandas dataframes, tensorflow Datasets, and PyTorch datasets.

class deepchem.data.Dataset[source]

Abstract base class for datasets defined by X, y, w elements.

Dataset objects are used to store representations of a dataset as used in a machine learning task. Datasets contain features X, labels y, weights w and identifiers ids. Different subclasses of Dataset may choose to hold X, y, w, ids in memory or on disk.

The Dataset class attempts to provide for strong interoperability with other machine learning representations for datasets. Interconversion methods allow for Dataset objects to be converted to and from numpy arrays, pandas dataframes, tensorflow datasets, and pytorch datasets (only to and not from for pytorch at present).

Note that you can never instantiate a Dataset object directly. Instead you will need to instantiate one of the concrete subclasses.

X

Get the X vector for this dataset as a single numpy array.

Returns:
Return type:Numpy array of features X.

Note

If data is stored on disk, accesing this field may involve loading data from disk and could potentially be slow. Using iterbatches() or itersamples() may be more efficient for larger datasets.

__init__()[source]

Initialize self. See help(type(self)) for accurate signature.

static from_dataframe(df, X=None, y=None, w=None, ids=None)[source]

Construct a Dataset from the contents of a pandas DataFrame.

Parameters:
  • df (DataFrame) – the pandas DataFrame
  • X (string or list of strings) – the name of the column or columns containing the X array. If this is None, it will look for default column names that match those produced by to_dataframe().
  • y (string or list of strings) – the name of the column or columns containing the y array. If this is None, it will look for default column names that match those produced by to_dataframe().
  • w (string or list of strings) – the name of the column or columns containing the w array. If this is None, it will look for default column names that match those produced by to_dataframe().
  • ids (string) – the name of the column containing the ids. If this is None, it will look for default column names that match those produced by to_dataframe().
get_shape()[source]

Get the shape of the dataset.

Returns four tuples, giving the shape of the X, y, w, and ids arrays.

get_statistics(X_stats=True, y_stats=True)[source]

Compute and return statistics of this dataset.

Uses self.itersamples() to compute means and standard deviations of the dataset. Can compute on large datasets that don’t fit in memory.

Parameters:
  • X_stats (bool, optional) – If True, compute feature-level mean and standard deviations.
  • y_stats (bool, optional) – If True, compute label-level mean and standard deviations.
Returns:

  • If X_stats == True, returns (X_means, X_stds). If y_stats == True,
  • returns (y_means, y_stds). If both are true, returns
  • (X_means, X_stds, y_means, y_stds).

get_task_names()[source]

Get the names of the tasks associated with this dataset.

ids

Get the ids vector for this dataset as a single numpy array.

Returns:
Return type:Numpy array of identifiers ids.

Note

If data is stored on disk, accesing this field may involve loading data from disk and could potentially be slow. Using iterbatches() or itersamples() may be more efficient for larger datasets.

iterbatches(batch_size=None, epochs=1, deterministic=False, pad_batches=False)[source]

Get an object that iterates over minibatches from the dataset.

Each minibatch is returned as a tuple of four numpy arrays: (X, y, w, ids).

Parameters:
  • batch_size (int, optional) – Number of elements in each batch
  • epochs (int, optional) – Number of epochs to walk over dataset
  • deterministic (bool, optional) – If True, follow deterministic order.
  • pad_batches (bool, optional) – If True, pad each batch to batch_size.
Returns:

Return type:

Generator which yields tuples of four numpy arrays (X, y, w, ids)

itersamples()[source]

Get an object that iterates over the samples in the dataset.

Example:

>>> dataset = NumpyDataset(np.ones((2,2)))
>>> for x, y, w, id in dataset.itersamples():
...   print(x.tolist(), y.tolist(), w.tolist(), id)
[1.0, 1.0] [0.0] [0.0] 0
[1.0, 1.0] [0.0] [0.0] 1
make_pytorch_dataset(epochs=1, deterministic=False)[source]

Create a torch.utils.data.IterableDataset that iterates over the data in this Dataset.

Each value returned by the Dataset’s iterator is a tuple of (X, y, w, id) for one sample.

Parameters:
  • epochs (int) – the number of times to iterate over the Dataset
  • deterministic (bool) – if True, the data is produced in order. If False, a different random permutation of the data is used for each epoch.
Returns:

  • torch.utils.data.IterableDataset that iterates over the data in
  • this dataset.

make_tf_dataset(batch_size=100, epochs=1, deterministic=False, pad_batches=False)[source]

Create a tf.data.Dataset that iterates over the data in this Dataset.

Each value returned by the Dataset’s iterator is a tuple of (X, y, w) for one batch.

Parameters:
  • batch_size (int) – the number of samples to include in each batch
  • epochs (int) – the number of times to iterate over the Dataset
  • deterministic (bool) – if True, the data is produced in order. If False, a different random permutation of the data is used for each epoch.
  • pad_batches (bool) – if True, batches are padded as necessary to make the size of each batch exactly equal batch_size.
Returns:

Return type:

tf.Dataset that iterates over the same data.

to_dataframe()[source]

Construct a pandas DataFrame containing the data from this Dataset.

Returns:
  • pandas dataframe. If there is only a single feature per datapoint,
  • will have column “X” else will have columns “X1,X2,…” for
  • features. If there is only a single label per datapoint, will
  • have column “y” else will have columns “y1,y2,…” for labels. If
  • there is only a single weight per datapoint will have column “w”
  • else will have columns “w1,w2,…”. Will have column “ids” for
  • identifiers.
transform(fn, **args)[source]

Construct a new dataset by applying a transformation to every sample in this dataset.

The argument is a function that can be called as follows:

>> newx, newy, neww = fn(x, y, w)

It might be called only once with the whole dataset, or multiple times with different subsets of the data. Each time it is called, it should transform the samples and return the transformed data.

Parameters:fn (function) – A function to apply to each sample in the dataset
Returns:
Return type:a newly constructed Dataset object
w

Get the weight vector for this dataset as a single numpy array.

Returns:
Return type:Numpy array of weights w.

Note

If data is stored on disk, accesing this field may involve loading data from disk and could potentially be slow. Using iterbatches() or itersamples() may be more efficient for larger datasets.

y

Get the y vector for this dataset as a single numpy array.

Returns:
Return type:Numpy array of labels y.

Note

If data is stored on disk, accesing this field may involve loading data from disk and could potentially be slow. Using iterbatches() or itersamples() may be more efficient for larger datasets.

NumpyDataset

The dc.data.NumpyDataset class provides an in-memory implementation of the abstract Dataset which stores its data in numpy.ndarray objects.

class deepchem.data.NumpyDataset(X, y=None, w=None, ids=None, n_tasks=1)[source]

A Dataset defined by in-memory numpy arrays.

This subclass of Dataset stores arrays X,y,w,ids in memory as numpy arrays. This makes it very easy to construct NumpyDataset objects. For example

>>> import numpy as np
>>> dataset = NumpyDataset(X=np.random.rand(5, 3), y=np.random.rand(5,), ids=np.arange(5))
X

Get the X vector for this dataset as a single numpy array.

__init__(X, y=None, w=None, ids=None, n_tasks=1)[source]

Initialize this object.

Parameters:
  • X (np.ndarray) – Input features. Of shape (n_samples,…)
  • y (np.ndarray, optional) – Labels. Of shape (n_samples, …). Note that each label can have an arbitrary shape.
  • w (np.ndarray, optional) – Weights. Should either be 1D of shape (n_samples,) or if there’s more than one task, of shape (n_samples, n_tasks).
  • ids (np.ndarray, optional) – Identifiers. Of shape (n_samples,)
  • n_tasks (int, optional) – Number of learning tasks.
static from_DiskDataset(ds)[source]
Parameters:
Returns:

Data of ds as NumpyDataset

Return type:

NumpyDataset

static from_dataframe(df, X=None, y=None, w=None, ids=None)[source]

Construct a Dataset from the contents of a pandas DataFrame.

Parameters:
  • df (DataFrame) – the pandas DataFrame
  • X (string or list of strings) – the name of the column or columns containing the X array. If this is None, it will look for default column names that match those produced by to_dataframe().
  • y (string or list of strings) – the name of the column or columns containing the y array. If this is None, it will look for default column names that match those produced by to_dataframe().
  • w (string or list of strings) – the name of the column or columns containing the w array. If this is None, it will look for default column names that match those produced by to_dataframe().
  • ids (string) – the name of the column containing the ids. If this is None, it will look for default column names that match those produced by to_dataframe().
get_shape()[source]

Get the shape of the dataset.

Returns four tuples, giving the shape of the X, y, w, and ids arrays.

get_statistics(X_stats=True, y_stats=True)[source]

Compute and return statistics of this dataset.

Uses self.itersamples() to compute means and standard deviations of the dataset. Can compute on large datasets that don’t fit in memory.

Parameters:
  • X_stats (bool, optional) – If True, compute feature-level mean and standard deviations.
  • y_stats (bool, optional) – If True, compute label-level mean and standard deviations.
Returns:

  • If X_stats == True, returns (X_means, X_stds). If y_stats == True,
  • returns (y_means, y_stds). If both are true, returns
  • (X_means, X_stds, y_means, y_stds).

get_task_names()[source]

Get the names of the tasks associated with this dataset.

ids

Get the ids vector for this dataset as a single numpy array.

iterbatches(batch_size=None, epochs=1, deterministic=False, pad_batches=False)[source]

Get an object that iterates over minibatches from the dataset.

Each minibatch is returned as a tuple of four numpy arrays: (X, y, w, ids).

Parameters:
  • batch_size (int, optional) – Number of elements in each batch
  • epochs (int, optional) – Number of epochs to walk over dataset
  • deterministic (bool, optional) – If True, follow deterministic order.
  • pad_batches (bool, optional) – If True, pad each batch to batch_size.
Returns:

Return type:

Generator which yields tuples of four numpy arrays (X, y, w, ids)

itersamples()[source]

Get an object that iterates over the samples in the dataset.

Example:

>>> dataset = NumpyDataset(np.ones((2,2)))
>>> for x, y, w, id in dataset.itersamples():
...   print(x.tolist(), y.tolist(), w.tolist(), id)
[1.0, 1.0] [0.0] [0.0] 0
[1.0, 1.0] [0.0] [0.0] 1
make_pytorch_dataset(epochs=1, deterministic=False)[source]

Create a torch.utils.data.IterableDataset that iterates over the data in this Dataset.

Each value returned by the Dataset’s iterator is a tuple of (X, y, w, id) for one sample.

Parameters:
  • epochs (int) – the number of times to iterate over the Dataset
  • deterministic (bool) – if True, the data is produced in order. If False, a different random permutation of the data is used for each epoch.
make_tf_dataset(batch_size=100, epochs=1, deterministic=False, pad_batches=False)[source]

Create a tf.data.Dataset that iterates over the data in this Dataset.

Each value returned by the Dataset’s iterator is a tuple of (X, y, w) for one batch.

Parameters:
  • batch_size (int) – the number of samples to include in each batch
  • epochs (int) – the number of times to iterate over the Dataset
  • deterministic (bool) – if True, the data is produced in order. If False, a different random permutation of the data is used for each epoch.
  • pad_batches (bool) – if True, batches are padded as necessary to make the size of each batch exactly equal batch_size.
Returns:

Return type:

tf.Dataset that iterates over the same data.

static merge(datasets)[source]
Parameters:datasets (list of deepchem.data.NumpyDataset) – list of datasets to merge
Returns:
Return type:Single deepchem.data.NumpyDataset with data concatenated over axis 0
select(indices, select_dir=None)[source]

Creates a new dataset from a selection of indices from self.

Parameters:
  • indices (list) – List of indices to select.
  • select_dir (string) – Used to provide same API as DiskDataset. Ignored since NumpyDataset is purely in-memory.
to_dataframe()[source]

Construct a pandas DataFrame containing the data from this Dataset.

Returns:
  • pandas dataframe. If there is only a single feature per datapoint,
  • will have column “X” else will have columns “X1,X2,…” for
  • features. If there is only a single label per datapoint, will
  • have column “y” else will have columns “y1,y2,…” for labels. If
  • there is only a single weight per datapoint will have column “w”
  • else will have columns “w1,w2,…”. Will have column “ids” for
  • identifiers.
transform(fn, **args)[source]

Construct a new dataset by applying a transformation to every sample in this dataset.

The argument is a function that can be called as follows:

>> newx, newy, neww = fn(x, y, w)

It might be called only once with the whole dataset, or multiple times with different subsets of the data. Each time it is called, it should transform the samples and return the transformed data.

Parameters:fn (function) – A function to apply to each sample in the dataset
Returns:
Return type:a newly constructed Dataset object
w

Get the weight vector for this dataset as a single numpy array.

y

Get the y vector for this dataset as a single numpy array.

DiskDataset

The dc.data.DiskDataset class allows for the storage of larger datasets on disk. Each DiskDataset is associated with a directory in which it writes its contents to disk. Note that a DiskDataset can be very large, so some of the utility methods to access fields of a Dataset can be prohibitively expensive.

class deepchem.data.DiskDataset(data_dir)[source]

A Dataset that is stored as a set of files on disk.

X

Get the X vector for this dataset as a single numpy array.

__init__(data_dir)[source]

Turns featurized dataframes into numpy files, writes them & metadata to disk.

add_shard(X, y, w, ids)[source]

Adds a data shard.

complete_shuffle(data_dir=None)[source]

Completely shuffle across all data, across all shards.

Note: this loads all the data into ram, and can be prohibitively expensive for larger datasets.

Parameters:shard_size (int) – size of the resulting dataset’s size. If None, then the first shard’s shard_size will be used.
Returns:A DiskDataset with a single shard.
Return type:DiskDataset
static create_dataset(shard_generator, data_dir=None, tasks=[])[source]

Creates a new DiskDataset

Parameters:
  • shard_generator (Iterable) – An iterable (either a list or generator) that provides tuples of data (X, y, w, ids). Each tuple will be written to a separate shard on disk.
  • data_dir (str) – Filename for data directory. Creates a temp directory if none specified.
  • tasks (list) – List of tasks for this dataset.
static from_dataframe(df, X=None, y=None, w=None, ids=None)[source]

Construct a Dataset from the contents of a pandas DataFrame.

Parameters:
  • df (DataFrame) – the pandas DataFrame
  • X (string or list of strings) – the name of the column or columns containing the X array. If this is None, it will look for default column names that match those produced by to_dataframe().
  • y (string or list of strings) – the name of the column or columns containing the y array. If this is None, it will look for default column names that match those produced by to_dataframe().
  • w (string or list of strings) – the name of the column or columns containing the w array. If this is None, it will look for default column names that match those produced by to_dataframe().
  • ids (string) – the name of the column containing the ids. If this is None, it will look for default column names that match those produced by to_dataframe().
static from_numpy(X, y=None, w=None, ids=None, tasks=None, data_dir=None)[source]

Creates a DiskDataset object from specified Numpy arrays.

get_data_shape()[source]

Gets array shape of datapoints in this dataset.

get_label_means()[source]

Return pandas series of label means.

get_label_stds()[source]

Return pandas series of label stds.

get_number_shards()[source]

Returns the number of shards for this dataset.

get_shape()[source]

Finds shape of dataset.

get_shard(i)[source]

Retrieves data for the i-th shard from disk.

get_shard_ids(i)[source]

Retrieves the list of IDs for the i-th shard from disk.

get_shard_size()[source]

Gets size of shards on disk.

get_shard_w(i)[source]

Retrieves the weights for the i-th shard from disk.

Parameters:i (int) – Shard index for shard to retrieve weights from
get_shard_y(i)[source]

Retrieves the labels for the i-th shard from disk.

Parameters:i (int) – Shard index for shard to retrieve labels from
get_statistics(X_stats=True, y_stats=True)[source]

Compute and return statistics of this dataset.

Uses self.itersamples() to compute means and standard deviations of the dataset. Can compute on large datasets that don’t fit in memory.

Parameters:
  • X_stats (bool, optional) – If True, compute feature-level mean and standard deviations.
  • y_stats (bool, optional) – If True, compute label-level mean and standard deviations.
Returns:

  • If X_stats == True, returns (X_means, X_stds). If y_stats == True,
  • returns (y_means, y_stds). If both are true, returns
  • (X_means, X_stds, y_means, y_stds).

get_task_names()[source]

Gets learning tasks associated with this dataset.

ids

Get the ids vector for this dataset as a single numpy array.

iterbatches(batch_size=None, epochs=1, deterministic=False, pad_batches=False)[source]

Get an object that iterates over minibatches from the dataset.

It is guaranteed that the number of batches returned is math.ceil(len(dataset)/batch_size). Each minibatch is returned as a tuple of four numpy arrays: (X, y, w, ids).

Parameters:
  • batch_size (int) – Number of elements in a batch. If None, then it yields batches with size equal to the size of each individual shard.
  • epoch (int) – Number of epochs to walk over dataset
  • deterministic (bool) – Whether or not we should should shuffle each shard before generating the batches. Note that this is only local in the sense that it does not ever mix between different shards.
  • pad_batches (bool) – Whether or not we should pad the last batch, globally, such that it has exactly batch_size elements.
itersamples()[source]

Get an object that iterates over the samples in the dataset.

Example:

>>> dataset = DiskDataset.from_numpy(np.ones((2,2)), np.ones((2,1)))
>>> for x, y, w, id in dataset.itersamples():
...   print(x.tolist(), y.tolist(), w.tolist(), id)
[1.0, 1.0] [1.0] [1.0] 0
[1.0, 1.0] [1.0] [1.0] 1
itershards()[source]

Return an object that iterates over all shards in dataset.

Datasets are stored in sharded fashion on disk. Each call to next() for the generator defined by this function returns the data from a particular shard. The order of shards returned is guaranteed to remain fixed.

make_pytorch_dataset(epochs=1, deterministic=False)[source]

Create a torch.utils.data.IterableDataset that iterates over the data in this Dataset.

Each value returned by the Dataset’s iterator is a tuple of (X, y, w, id) for one sample.

Parameters:
  • epochs (int) – the number of times to iterate over the Dataset
  • deterministic (bool) – if True, the data is produced in order. If False, a different random permutation of the data is used for each epoch.
make_tf_dataset(batch_size=100, epochs=1, deterministic=False, pad_batches=False)[source]

Create a tf.data.Dataset that iterates over the data in this Dataset.

Each value returned by the Dataset’s iterator is a tuple of (X, y, w) for one batch.

Parameters:
  • batch_size (int) – the number of samples to include in each batch
  • epochs (int) – the number of times to iterate over the Dataset
  • deterministic (bool) – if True, the data is produced in order. If False, a different random permutation of the data is used for each epoch.
  • pad_batches (bool) – if True, batches are padded as necessary to make the size of each batch exactly equal batch_size.
Returns:

Return type:

tf.Dataset that iterates over the same data.

memory_cache_size

Get the size of the memory cache for this dataset, measured in bytes.

static merge(datasets, merge_dir=None)[source]

Merges provided datasets into a merged dataset.

move(new_data_dir)[source]

Moves dataset to new directory.

reshard(shard_size)[source]

Reshards data to have specified shard size.

save_to_disk()[source]

Save dataset to disk.

select(indices, select_dir=None)[source]

Creates a new dataset from a selection of indices from self.

Parameters:
  • indices (list) – List of indices to select.
  • select_dir (string) – Path to new directory that the selected indices will be copied to.
set_shard(shard_num, X, y, w, ids)[source]

Writes data shard to disk

shuffle_each_shard()[source]

Shuffles elements within each shard of the datset.

shuffle_shards()[source]

Shuffles the order of the shards for this dataset.

sparse_shuffle()[source]

Shuffling that exploits data sparsity to shuffle large datasets.

Only for 1-dimensional feature vectors (does not work for tensorial featurizations).

subset(shard_nums, subset_dir=None)[source]

Creates a subset of the original dataset on disk.

to_dataframe()[source]

Construct a pandas DataFrame containing the data from this Dataset.

Returns:
  • pandas dataframe. If there is only a single feature per datapoint,
  • will have column “X” else will have columns “X1,X2,…” for
  • features. If there is only a single label per datapoint, will
  • have column “y” else will have columns “y1,y2,…” for labels. If
  • there is only a single weight per datapoint will have column “w”
  • else will have columns “w1,w2,…”. Will have column “ids” for
  • identifiers.
transform(fn, **args)[source]

Construct a new dataset by applying a transformation to every sample in this dataset.

The argument is a function that can be called as follows:

>> newx, newy, neww = fn(x, y, w)

It might be called only once with the whole dataset, or multiple times with different subsets of the data. Each time it is called, it should transform the samples and return the transformed data.

Parameters:
  • fn (function) – A function to apply to each sample in the dataset
  • out_dir (string) – The directory to save the new dataset in. If this is omitted, a temporary directory is created automatically
Returns:

Return type:

a newly constructed Dataset object

w

Get the weight vector for this dataset as a single numpy array.

y

Get the y vector for this dataset as a single numpy array.

ImageDataset

The dc.data.ImageDataset class is optimized to allow for convenient processing of image based datasets.

class deepchem.data.ImageDataset(X, y, w=None, ids=None)[source]

A Dataset that loads data from image files on disk.

X

Get the X vector for this dataset as a single numpy array.

__init__(X, y, w=None, ids=None)[source]

Create a dataset whose X and/or y array is defined by image files on disk.

Parameters:
  • X (ndarray or list of strings) – The dataset’s input data. This may be either a single NumPy array directly containing the data, or a list containing the paths to the image files
  • y (ndarray or list of strings) – The dataset’s labels. This may be either a single NumPy array directly containing the data, or a list containing the paths to the image files
  • w (ndarray) – a 1D or 2D array containing the weights for each sample or sample/task pair
  • ids (ndarray) – the sample IDs
static from_dataframe(df, X=None, y=None, w=None, ids=None)[source]

Construct a Dataset from the contents of a pandas DataFrame.

Parameters:
  • df (DataFrame) – the pandas DataFrame
  • X (string or list of strings) – the name of the column or columns containing the X array. If this is None, it will look for default column names that match those produced by to_dataframe().
  • y (string or list of strings) – the name of the column or columns containing the y array. If this is None, it will look for default column names that match those produced by to_dataframe().
  • w (string or list of strings) – the name of the column or columns containing the w array. If this is None, it will look for default column names that match those produced by to_dataframe().
  • ids (string) – the name of the column containing the ids. If this is None, it will look for default column names that match those produced by to_dataframe().
get_shape()[source]

Get the shape of the dataset.

Returns four tuples, giving the shape of the X, y, w, and ids arrays.

get_statistics(X_stats=True, y_stats=True)[source]

Compute and return statistics of this dataset.

Uses self.itersamples() to compute means and standard deviations of the dataset. Can compute on large datasets that don’t fit in memory.

Parameters:
  • X_stats (bool, optional) – If True, compute feature-level mean and standard deviations.
  • y_stats (bool, optional) – If True, compute label-level mean and standard deviations.
Returns:

  • If X_stats == True, returns (X_means, X_stds). If y_stats == True,
  • returns (y_means, y_stds). If both are true, returns
  • (X_means, X_stds, y_means, y_stds).

get_task_names()[source]

Get the names of the tasks associated with this dataset.

ids

Get the ids vector for this dataset as a single numpy array.

iterbatches(batch_size=None, epochs=1, deterministic=False, pad_batches=False)[source]

Get an object that iterates over minibatches from the dataset.

Each minibatch is returned as a tuple of four numpy arrays: (X, y, w, ids).

itersamples()[source]

Get an object that iterates over the samples in the dataset.

Example:

>>> dataset = NumpyDataset(np.ones((2,2)))
>>> for x, y, w, id in dataset.itersamples():
...   print(x.tolist(), y.tolist(), w.tolist(), id)
[1.0, 1.0] [0.0] [0.0] 0
[1.0, 1.0] [0.0] [0.0] 1
make_pytorch_dataset(epochs=1, deterministic=False)[source]

Create a torch.utils.data.IterableDataset that iterates over the data in this Dataset.

Each value returned by the Dataset’s iterator is a tuple of (X, y, w, id) for one sample.

Parameters:
  • epochs (int) – the number of times to iterate over the Dataset
  • deterministic (bool) – if True, the data is produced in order. If False, a different random permutation of the data is used for each epoch.
Returns:

  • torch.utils.data.IterableDataset iterating over the same data as
  • this dataset.

make_tf_dataset(batch_size=100, epochs=1, deterministic=False, pad_batches=False)[source]

Create a tf.data.Dataset that iterates over the data in this Dataset.

Each value returned by the Dataset’s iterator is a tuple of (X, y, w) for one batch.

Parameters:
  • batch_size (int) – the number of samples to include in each batch
  • epochs (int) – the number of times to iterate over the Dataset
  • deterministic (bool) – if True, the data is produced in order. If False, a different random permutation of the data is used for each epoch.
  • pad_batches (bool) – if True, batches are padded as necessary to make the size of each batch exactly equal batch_size.
Returns:

Return type:

tf.Dataset that iterates over the same data.

select(indices, select_dir=None)[source]

Creates a new dataset from a selection of indices from self.

Parameters:
  • indices (list) – List of indices to select.
  • select_dir (string) – Used to provide same API as DiskDataset. Ignored since ImageDataset is purely in-memory.
to_dataframe()[source]

Construct a pandas DataFrame containing the data from this Dataset.

Returns:
  • pandas dataframe. If there is only a single feature per datapoint,
  • will have column “X” else will have columns “X1,X2,…” for
  • features. If there is only a single label per datapoint, will
  • have column “y” else will have columns “y1,y2,…” for labels. If
  • there is only a single weight per datapoint will have column “w”
  • else will have columns “w1,w2,…”. Will have column “ids” for
  • identifiers.
transform(fn, **args)[source]

Construct a new dataset by applying a transformation to every sample in this dataset.

The argument is a function that can be called as follows:

>> newx, newy, neww = fn(x, y, w)

It might be called only once with the whole dataset, or multiple times with different subsets of the data. Each time it is called, it should transform the samples and return the transformed data.

Parameters:fn (function) – A function to apply to each sample in the dataset
Returns:
Return type:a newly constructed Dataset object
w

Get the weight vector for this dataset as a single numpy array.

y

Get the y vector for this dataset as a single numpy array.