Data Loaders

Processing large amounts of input data to construct a dc.data.Dataset object can require some amount of hacking. To simplify this process for you, you can use the dc.data.DataLoader classes. These classes provide utilities for you to load and process large amounts of data.

DataLoader

class deepchem.data.DataLoader(tasks, id_field=None, featurizer=None, log_every_n=1000)[source]

Handles loading/featurizing of data from disk.

The main use of DataLoader and its child classes is to make it easier to load large datasets into Dataset objects.`

DataLoader is an abstract superclass that provides a general framework for loading data into DeepChem. This class should never be instantiated directly. To load your own type of data, make a subclass of DataLoader and provide your own implementation for the create_dataset() method.

To construct a Dataset from input data, first instantiate a concrete data loader (that is, an object which is an instance of a subclass of DataLoader) with a given Featurizer object. Then call the data loader’s create_dataset() method on a list of input files that hold the source data to process. Note that each subclass of DataLoader is specialized to handle one type of input data so you will have to pick the loader class suitable for your input data type.

Note that it isn’t necessary to use a data loader to process input data. You can directly use Featurizer objects to featurize provided input into numpy arrays, but note that this calculation will be performed in memory, so you will have to write generators that walk the source files and write featurized data to disk yourself. DataLoader and its subclasses make this process easier for you by performing this work under the hood.

__init__(tasks, id_field=None, featurizer=None, log_every_n=1000)[source]

Construct a DataLoader object.

This constructor is provided as a template mainly. You shouldn’t ever call this constructor directly as a user.

Parameters:
  • tasks (list[str]) – List of task names
  • id_field (str, optional) – Name of field that holds sample identifier. Note that the meaning of “field” depends on the input data type and can have a different meaning in different subclasses. For example, a CSV file could have a field as a column, and an SDF file could have a field as molecular property.
  • featurizer (dc.feat.Featurizer, optional) – Featurizer to use to process data
  • log_every_n (int, optional) – Writes a logging statement this often.
create_dataset(input_files, data_dir=None, shard_size=8192)[source]

Creates and returns a Dataset object by featurizing provided files.

Reads in input_files and uses self.featurizer to featurize the data in these input files. For large files, automatically shards into smaller chunks of shard_size datapoints for convenience. Returns a Dataset object that contains the featurized dataset.

This implementation assumes that the helper methods _get_shards and _featurize_shard are implemented and that each shard returned by _get_shards is a pandas dataframe. You may choose to reuse or override this method in your subclass implementations.

Parameters:
  • input_files (list) – List of input filenames.
  • data_dir (str, optional) – Directory to store featurized dataset.
  • shard_size (int, optional) – Number of examples stored in each shard.
Returns:

  • A Dataset object containing a featurized representation of data
  • from input_files.

featurize(input_files, data_dir=None, shard_size=8192)[source]

Featurize provided files and write to specified location.

DEPRECATED: This method is now a wrapper for create_dataset() and calls that method under the hood.

For large datasets, automatically shards into smaller chunks for convenience. This implementation assumes that the helper methods _get_shards and _featurize_shard are implemented and that each shard returned by _get_shards is a pandas dataframe. You may choose to reuse or override this method in your subclass implementations.

Parameters:
  • input_files (list) – List of input filenames.
  • data_dir (str, optional) – Directory to store featurized dataset.
  • shard_size (int, optional) – Number of examples stored in each shard.
Returns:

  • A Dataset object containing a featurized representation of data
  • from input_files.

CSVLoader

class deepchem.data.CSVLoader(tasks, smiles_field=None, id_field=None, featurizer=None, log_every_n=1000)[source]

Creates Dataset objects from input CSF files.

This class provides conveniences to load data from CSV files. It’s possible to directly featurize data from CSV files using pandas, but this class may prove useful if you’re processing large CSV files that you don’t want to manipulate directly in memory.

__init__(tasks, smiles_field=None, id_field=None, featurizer=None, log_every_n=1000)[source]

Initializes CSVLoader.

Parameters:
  • tasks (list[str]) – List of task names
  • smiles_field (str, optional) – Name of field that holds smiles string
  • id_field (str, optional) – Name of field that holds sample identifier
  • featurizer (dc.feat.Featurizer, optional) – Featurizer to use to process data
  • log_every_n (int, optional) – Writes a logging statement this often.
create_dataset(input_files, data_dir=None, shard_size=8192)[source]

Creates and returns a Dataset object by featurizing provided files.

Reads in input_files and uses self.featurizer to featurize the data in these input files. For large files, automatically shards into smaller chunks of shard_size datapoints for convenience. Returns a Dataset object that contains the featurized dataset.

This implementation assumes that the helper methods _get_shards and _featurize_shard are implemented and that each shard returned by _get_shards is a pandas dataframe. You may choose to reuse or override this method in your subclass implementations.

Parameters:
  • input_files (list) – List of input filenames.
  • data_dir (str, optional) – Directory to store featurized dataset.
  • shard_size (int, optional) – Number of examples stored in each shard.
Returns:

  • A Dataset object containing a featurized representation of data
  • from input_files.

featurize(input_files, data_dir=None, shard_size=8192)[source]

Featurize provided files and write to specified location.

DEPRECATED: This method is now a wrapper for create_dataset() and calls that method under the hood.

For large datasets, automatically shards into smaller chunks for convenience. This implementation assumes that the helper methods _get_shards and _featurize_shard are implemented and that each shard returned by _get_shards is a pandas dataframe. You may choose to reuse or override this method in your subclass implementations.

Parameters:
  • input_files (list) – List of input filenames.
  • data_dir (str, optional) – Directory to store featurized dataset.
  • shard_size (int, optional) – Number of examples stored in each shard.
Returns:

  • A Dataset object containing a featurized representation of data
  • from input_files.

UserCSVLoader

class deepchem.data.UserCSVLoader(tasks, smiles_field=None, id_field=None, featurizer=None, log_every_n=1000)[source]

Handles loading of CSV files with user-defined featurizers.

__init__(tasks, smiles_field=None, id_field=None, featurizer=None, log_every_n=1000)[source]

Initializes CSVLoader.

Parameters:
  • tasks (list[str]) – List of task names
  • smiles_field (str, optional) – Name of field that holds smiles string
  • id_field (str, optional) – Name of field that holds sample identifier
  • featurizer (dc.feat.Featurizer, optional) – Featurizer to use to process data
  • log_every_n (int, optional) – Writes a logging statement this often.
create_dataset(input_files, data_dir=None, shard_size=8192)[source]

Creates and returns a Dataset object by featurizing provided files.

Reads in input_files and uses self.featurizer to featurize the data in these input files. For large files, automatically shards into smaller chunks of shard_size datapoints for convenience. Returns a Dataset object that contains the featurized dataset.

This implementation assumes that the helper methods _get_shards and _featurize_shard are implemented and that each shard returned by _get_shards is a pandas dataframe. You may choose to reuse or override this method in your subclass implementations.

Parameters:
  • input_files (list) – List of input filenames.
  • data_dir (str, optional) – Directory to store featurized dataset.
  • shard_size (int, optional) – Number of examples stored in each shard.
Returns:

  • A Dataset object containing a featurized representation of data
  • from input_files.

featurize(input_files, data_dir=None, shard_size=8192)[source]

Featurize provided files and write to specified location.

DEPRECATED: This method is now a wrapper for create_dataset() and calls that method under the hood.

For large datasets, automatically shards into smaller chunks for convenience. This implementation assumes that the helper methods _get_shards and _featurize_shard are implemented and that each shard returned by _get_shards is a pandas dataframe. You may choose to reuse or override this method in your subclass implementations.

Parameters:
  • input_files (list) – List of input filenames.
  • data_dir (str, optional) – Directory to store featurized dataset.
  • shard_size (int, optional) – Number of examples stored in each shard.
Returns:

  • A Dataset object containing a featurized representation of data
  • from input_files.

FASTALoader

class deepchem.data.FASTALoader[source]

Handles loading of FASTA files.

FASTA files are commonly used to hold sequence data. This class provides convenience files to lead FASTA data and one-hot encode the genomic sequences for use in downstream learning tasks.

__init__()[source]

Initialize loader.

create_dataset(input_files, data_dir=None, shard_size=None)[source]

Creates a Dataset from input FASTA files.

At present, FASTA support is limited and only allows for one-hot featurization, and doesn’t allow for sharding.

Parameters:
  • input_files (list) – List of fasta files.
  • data_dir (str, optional) – Name of directory where featurized data is stored.
  • shard_size (int, optional) – For now, this argument is ignored and each FASTA file gets its own shard.
Returns:

  • A Dataset object containing a featurized representation of data
  • from input_files.

featurize(input_files, data_dir=None, shard_size=8192)[source]

Featurize provided files and write to specified location.

DEPRECATED: This method is now a wrapper for create_dataset() and calls that method under the hood.

For large datasets, automatically shards into smaller chunks for convenience. This implementation assumes that the helper methods _get_shards and _featurize_shard are implemented and that each shard returned by _get_shards is a pandas dataframe. You may choose to reuse or override this method in your subclass implementations.

Parameters:
  • input_files (list) – List of input filenames.
  • data_dir (str, optional) – Directory to store featurized dataset.
  • shard_size (int, optional) – Number of examples stored in each shard.
Returns:

  • A Dataset object containing a featurized representation of data
  • from input_files.

ImageLoader

class deepchem.data.ImageLoader(tasks=None)[source]

Handles loading of image files.

This class allows for loading of images in various formats. For user convenience, also accepts zip-files and directories of images and uses some limited intelligence to attempt to traverse subdirectories which contain images.

__init__(tasks=None)[source]

Initialize image loader.

At present, custom image featurizers aren’t supported by this loader class.

Parameters:tasks (list[str]) – List of task names for image labels.
create_dataset(input_files, labels=None, weights=None, in_memory=False)[source]

Creates and returns a Dataset object by featurizing provided image files and labels/weights.

Parameters:
  • input_files (list) – Each file in this list should either be of a supported image format (.png, .tif only for now) or of a compressed folder of image files (only .zip for now).
  • labels (optional) – If provided, a numpy ndarray of image labels
  • weights (optional) – If provided, a numpy ndarray of image weights
  • in_memory (bool) – If true, return in-memory NumpyDataset. Else return ImageDataset.
Returns:

  • A Dataset object containing a featurized representation of data
  • from input_files, labels, and weights.

featurize(input_files, data_dir=None, shard_size=8192)[source]

Featurize provided files and write to specified location.

DEPRECATED: This method is now a wrapper for create_dataset() and calls that method under the hood.

For large datasets, automatically shards into smaller chunks for convenience. This implementation assumes that the helper methods _get_shards and _featurize_shard are implemented and that each shard returned by _get_shards is a pandas dataframe. You may choose to reuse or override this method in your subclass implementations.

Parameters:
  • input_files (list) – List of input filenames.
  • data_dir (str, optional) – Directory to store featurized dataset.
  • shard_size (int, optional) – Number of examples stored in each shard.
Returns:

  • A Dataset object containing a featurized representation of data
  • from input_files.