Data Loaders

Processing large amounts of input data to construct a dc.data.Dataset object can require some amount of hacking. To simplify this process for you, you can use the dc.data.DataLoader classes. These classes provide utilities for you to load and process large amounts of data.

DataLoader

class deepchem.data.DataLoader(tasks: List[str], featurizer: deepchem.feat.base_classes.Featurizer, id_field: Optional[str] = None, log_every_n: int = 1000)[source]

Handles loading/featurizing of data from disk.

The main use of DataLoader and its child classes is to make it easier to load large datasets into Dataset objects.`

DataLoader is an abstract superclass that provides a general framework for loading data into DeepChem. This class should never be instantiated directly. To load your own type of data, make a subclass of DataLoader and provide your own implementation for the create_dataset() method.

To construct a Dataset from input data, first instantiate a concrete data loader (that is, an object which is an instance of a subclass of DataLoader) with a given Featurizer object. Then call the data loader’s create_dataset() method on a list of input files that hold the source data to process. Note that each subclass of DataLoader is specialized to handle one type of input data so you will have to pick the loader class suitable for your input data type.

Note that it isn’t necessary to use a data loader to process input data. You can directly use Featurizer objects to featurize provided input into numpy arrays, but note that this calculation will be performed in memory, so you will have to write generators that walk the source files and write featurized data to disk yourself. DataLoader and its subclasses make this process easier for you by performing this work under the hood.

__init__(tasks: List[str], featurizer: deepchem.feat.base_classes.Featurizer, id_field: Optional[str] = None, log_every_n: int = 1000)[source]

Construct a DataLoader object.

This constructor is provided as a template mainly. You shouldn’t ever call this constructor directly as a user.

Parameters:
  • tasks (List[str]) – List of task names
  • featurizer (Featurizer) – Featurizer to use to process data.
  • id_field (str, optional (default None)) – Name of field that holds sample identifier. Note that the meaning of “field” depends on the input data type and can have a different meaning in different subclasses. For example, a CSV file could have a field as a column, and an SDF file could have a field as molecular property.
  • log_every_n (int, optional (default 1000)) – Writes a logging statement this often.
create_dataset(inputs: Union[Any, Sequence[Any]], data_dir: Optional[str] = None, shard_size: Optional[int] = 8192) → deepchem.data.datasets.Dataset[source]

Creates and returns a Dataset object by featurizing provided files.

Reads in inputs and uses self.featurizer to featurize the data in these inputs. For large files, automatically shards into smaller chunks of shard_size datapoints for convenience. Returns a Dataset object that contains the featurized dataset.

This implementation assumes that the helper methods _get_shards and _featurize_shard are implemented and that each shard returned by _get_shards is a pandas dataframe. You may choose to reuse or override this method in your subclass implementations.

Parameters:
  • inputs (List) – List of inputs to process. Entries can be filenames or arbitrary objects.
  • data_dir (str, optional (default None)) – Directory to store featurized dataset.
  • shard_size (int, optional (default 8192)) – Number of examples stored in each shard.
Returns:

A DiskDataset object containing a featurized representation of data from inputs.

Return type:

DiskDataset

featurize(inputs: Union[Any, Sequence[Any]], data_dir: Optional[str] = None, shard_size: Optional[int] = 8192) → deepchem.data.datasets.Dataset[source]

Featurize provided files and write to specified location.

DEPRECATED: This method is now a wrapper for create_dataset() and calls that method under the hood.

For large datasets, automatically shards into smaller chunks for convenience. This implementation assumes that the helper methods _get_shards and _featurize_shard are implemented and that each shard returned by _get_shards is a pandas dataframe. You may choose to reuse or override this method in your subclass implementations.

Parameters:
  • inputs (List) – List of inputs to process. Entries can be filenames or arbitrary objects.
  • data_dir (str, default None) – Directory to store featurized dataset.
  • shard_size (int, optional (default 8192)) – Number of examples stored in each shard.
Returns:

A Dataset object containing a featurized representation of data from inputs.

Return type:

Dataset

CSVLoader

class deepchem.data.CSVLoader(tasks: List[str], featurizer: deepchem.feat.base_classes.Featurizer, feature_field: Optional[str] = None, id_field: Optional[str] = None, smiles_field: Optional[str] = None, log_every_n: int = 1000)[source]

Creates Dataset objects from input CSV files.

This class provides conveniences to load data from CSV files. It’s possible to directly featurize data from CSV files using pandas, but this class may prove useful if you’re processing large CSV files that you don’t want to manipulate directly in memory.

Examples

Let’s suppose we have some smiles and labels

>>> smiles = ["C", "CCC"]
>>> labels = [1.5, 2.3]

Let’s put these in a dataframe.

>>> import pandas as pd
>>> df = pd.DataFrame(list(zip(smiles, labels)), columns=["smiles", "task1"])

Let’s now write this to disk somewhere. We can now use CSVLoader to process this CSV dataset.

>>> import tempfile
>>> import deepchem as dc
>>> with tempfile.NamedTemporaryFile(mode='w') as tmpfile:
...   df.to_csv(tmpfile.name)
...   loader = dc.data.CSVLoader(["task1"], feature_field="smiles",
...                              featurizer=dc.feat.CircularFingerprint())
...   dataset = loader.create_dataset(tmpfile.name)
>>> len(dataset)
2

Of course in practice you should already have your data in a CSV file if you’re using CSVLoader. If your data is already in memory, use InMemoryLoader instead.

__init__(tasks: List[str], featurizer: deepchem.feat.base_classes.Featurizer, feature_field: Optional[str] = None, id_field: Optional[str] = None, smiles_field: Optional[str] = None, log_every_n: int = 1000)[source]

Initializes CSVLoader.

Parameters:
  • tasks (List[str]) – List of task names
  • featurizer (Featurizer) – Featurizer to use to process data.
  • feature_field (str, optional (default None)) – Field with data to be featurized.
  • id_field (str, optional, (default None)) – CSV column that holds sample identifier
  • smiles_field (str, optional (default None) (DEPRECATED)) – Name of field that holds smiles string.
  • log_every_n (int, optional (default 1000)) – Writes a logging statement this often.
create_dataset(inputs: Union[Any, Sequence[Any]], data_dir: Optional[str] = None, shard_size: Optional[int] = 8192) → deepchem.data.datasets.Dataset[source]

Creates and returns a Dataset object by featurizing provided files.

Reads in inputs and uses self.featurizer to featurize the data in these inputs. For large files, automatically shards into smaller chunks of shard_size datapoints for convenience. Returns a Dataset object that contains the featurized dataset.

This implementation assumes that the helper methods _get_shards and _featurize_shard are implemented and that each shard returned by _get_shards is a pandas dataframe. You may choose to reuse or override this method in your subclass implementations.

Parameters:
  • inputs (List) – List of inputs to process. Entries can be filenames or arbitrary objects.
  • data_dir (str, optional (default None)) – Directory to store featurized dataset.
  • shard_size (int, optional (default 8192)) – Number of examples stored in each shard.
Returns:

A DiskDataset object containing a featurized representation of data from inputs.

Return type:

DiskDataset

featurize(inputs: Union[Any, Sequence[Any]], data_dir: Optional[str] = None, shard_size: Optional[int] = 8192) → deepchem.data.datasets.Dataset[source]

Featurize provided files and write to specified location.

DEPRECATED: This method is now a wrapper for create_dataset() and calls that method under the hood.

For large datasets, automatically shards into smaller chunks for convenience. This implementation assumes that the helper methods _get_shards and _featurize_shard are implemented and that each shard returned by _get_shards is a pandas dataframe. You may choose to reuse or override this method in your subclass implementations.

Parameters:
  • inputs (List) – List of inputs to process. Entries can be filenames or arbitrary objects.
  • data_dir (str, default None) – Directory to store featurized dataset.
  • shard_size (int, optional (default 8192)) – Number of examples stored in each shard.
Returns:

A Dataset object containing a featurized representation of data from inputs.

Return type:

Dataset

UserCSVLoader

class deepchem.data.UserCSVLoader(tasks: List[str], featurizer: deepchem.feat.base_classes.Featurizer, feature_field: Optional[str] = None, id_field: Optional[str] = None, smiles_field: Optional[str] = None, log_every_n: int = 1000)[source]

Handles loading of CSV files with user-defined features.

This is a convenience class that allows for descriptors already present in a CSV file to be extracted without any featurization necessary.

Examples

Let’s suppose we have some descriptors and labels. (Imagine that these descriptors have been computed by an external program.)

>>> desc1 = [1, 43]
>>> desc2 = [-2, -22]
>>> labels = [1.5, 2.3]
>>> ids = ["cp1", "cp2"]

Let’s put these in a dataframe.

>>> import pandas as pd
>>> df = pd.DataFrame(list(zip(ids, desc1, desc2, labels)), columns=["id", "desc1", "desc2", "task1"])

Let’s now write this to disk somewhere. We can now use UserCSVLoader to process this CSV dataset.

>>> import tempfile
>>> import deepchem as dc
>>> featurizer = dc.feat.UserDefinedFeaturizer(["desc1", "desc2"])
>>> with tempfile.NamedTemporaryFile(mode='w') as tmpfile:
...   df.to_csv(tmpfile.name)
...   loader = dc.data.UserCSVLoader(["task1"], id_field="id",
...                              featurizer=featurizer)
...   dataset = loader.create_dataset(tmpfile.name)
>>> len(dataset)
2
>>> dataset.X[0, 0]
1

The difference between UserCSVLoader and CSVLoader is that our descriptors (our features) have already been computed for us, but are spread across multiple columns of the CSV file.

Of course in practice you should already have your data in a CSV file if you’re using UserCSVLoader. If your data is already in memory, use InMemoryLoader instead.

__init__(tasks: List[str], featurizer: deepchem.feat.base_classes.Featurizer, feature_field: Optional[str] = None, id_field: Optional[str] = None, smiles_field: Optional[str] = None, log_every_n: int = 1000)[source]

Initializes CSVLoader.

Parameters:
  • tasks (List[str]) – List of task names
  • featurizer (Featurizer) – Featurizer to use to process data.
  • feature_field (str, optional (default None)) – Field with data to be featurized.
  • id_field (str, optional, (default None)) – CSV column that holds sample identifier
  • smiles_field (str, optional (default None) (DEPRECATED)) – Name of field that holds smiles string.
  • log_every_n (int, optional (default 1000)) – Writes a logging statement this often.
create_dataset(inputs: Union[Any, Sequence[Any]], data_dir: Optional[str] = None, shard_size: Optional[int] = 8192) → deepchem.data.datasets.Dataset[source]

Creates and returns a Dataset object by featurizing provided files.

Reads in inputs and uses self.featurizer to featurize the data in these inputs. For large files, automatically shards into smaller chunks of shard_size datapoints for convenience. Returns a Dataset object that contains the featurized dataset.

This implementation assumes that the helper methods _get_shards and _featurize_shard are implemented and that each shard returned by _get_shards is a pandas dataframe. You may choose to reuse or override this method in your subclass implementations.

Parameters:
  • inputs (List) – List of inputs to process. Entries can be filenames or arbitrary objects.
  • data_dir (str, optional (default None)) – Directory to store featurized dataset.
  • shard_size (int, optional (default 8192)) – Number of examples stored in each shard.
Returns:

A DiskDataset object containing a featurized representation of data from inputs.

Return type:

DiskDataset

featurize(inputs: Union[Any, Sequence[Any]], data_dir: Optional[str] = None, shard_size: Optional[int] = 8192) → deepchem.data.datasets.Dataset[source]

Featurize provided files and write to specified location.

DEPRECATED: This method is now a wrapper for create_dataset() and calls that method under the hood.

For large datasets, automatically shards into smaller chunks for convenience. This implementation assumes that the helper methods _get_shards and _featurize_shard are implemented and that each shard returned by _get_shards is a pandas dataframe. You may choose to reuse or override this method in your subclass implementations.

Parameters:
  • inputs (List) – List of inputs to process. Entries can be filenames or arbitrary objects.
  • data_dir (str, default None) – Directory to store featurized dataset.
  • shard_size (int, optional (default 8192)) – Number of examples stored in each shard.
Returns:

A Dataset object containing a featurized representation of data from inputs.

Return type:

Dataset

JsonLoader

JSON is a flexible file format that is human-readable, lightweight, and more compact than other open standard formats like XML. JSON files are similar to python dictionaries of key-value pairs. All keys must be strings, but values can be any of (string, number, object, array, boolean, or null), so the format is more flexible than CSV. JSON is used for describing structured data and to serialize objects. It is conveniently used to read/write Pandas dataframes with the pandas.read_json and pandas.write_json methods.

class deepchem.data.JsonLoader(tasks: List[str], feature_field: str, featurizer: deepchem.feat.base_classes.Featurizer, label_field: Optional[str] = None, weight_field: Optional[str] = None, id_field: Optional[str] = None, log_every_n: int = 1000)[source]

Creates Dataset objects from input json files.

This class provides conveniences to load data from json files. It’s possible to directly featurize data from json files using pandas, but this class may prove useful if you’re processing large json files that you don’t want to manipulate directly in memory.

It is meant to load JSON files formatted as “records” in line delimited format, which allows for sharding. list like [{column -> value}, ... , {column -> value}].

Examples

>> import pandas as pd >> df = pd.DataFrame(some_data) >> df.columns.tolist() .. [‘sample_data’, ‘sample_name’, ‘weight’, ‘task’] >> df.to_json(‘file.json’, orient=’records’, lines=True) >> loader = JsonLoader(tasks=[‘task’], feature_field=’sample_data’,

label_field=’task’, weight_field=’weight’, id_field=’sample_name’)

>> dataset = loader.create_dataset(‘file.json’)

__init__(tasks: List[str], feature_field: str, featurizer: deepchem.feat.base_classes.Featurizer, label_field: Optional[str] = None, weight_field: Optional[str] = None, id_field: Optional[str] = None, log_every_n: int = 1000)[source]

Initializes JsonLoader.

Parameters:
  • tasks (List[str]) – List of task names
  • feature_field (str) – JSON field with data to be featurized.
  • featurizer (Featurizer) – Featurizer to use to process data
  • label_field (str, optional (default None)) – Field with target variables.
  • weight_field (str, optional (default None)) – Field with weights.
  • id_field (str, optional (default None)) – Field for identifying samples.
  • log_every_n (int, optional (default 1000)) – Writes a logging statement this often.
create_dataset(input_files: Union[str, Sequence[str]], data_dir: Optional[str] = None, shard_size: Optional[int] = 8192) → deepchem.data.datasets.DiskDataset[source]

Creates a Dataset from input JSON files.

Parameters:
  • input_files (OneOrMany[str]) – List of JSON filenames.
  • data_dir (Optional[str], default None) – Name of directory where featurized data is stored.
  • shard_size (int, optional (default 8192)) – Shard size when loading data.
Returns:

A DiskDataset object containing a featurized representation of data from input_files.

Return type:

DiskDataset

featurize(inputs: Union[Any, Sequence[Any]], data_dir: Optional[str] = None, shard_size: Optional[int] = 8192) → deepchem.data.datasets.Dataset[source]

Featurize provided files and write to specified location.

DEPRECATED: This method is now a wrapper for create_dataset() and calls that method under the hood.

For large datasets, automatically shards into smaller chunks for convenience. This implementation assumes that the helper methods _get_shards and _featurize_shard are implemented and that each shard returned by _get_shards is a pandas dataframe. You may choose to reuse or override this method in your subclass implementations.

Parameters:
  • inputs (List) – List of inputs to process. Entries can be filenames or arbitrary objects.
  • data_dir (str, default None) – Directory to store featurized dataset.
  • shard_size (int, optional (default 8192)) – Number of examples stored in each shard.
Returns:

A Dataset object containing a featurized representation of data from inputs.

Return type:

Dataset

FASTALoader

class deepchem.data.FASTALoader[source]

Handles loading of FASTA files.

FASTA files are commonly used to hold sequence data. This class provides convenience files to lead FASTA data and one-hot encode the genomic sequences for use in downstream learning tasks.

__init__()[source]

Initialize loader.

create_dataset(input_files: Union[str, Sequence[str]], data_dir: Optional[str] = None, shard_size: Optional[int] = None) → deepchem.data.datasets.DiskDataset[source]

Creates a Dataset from input FASTA files.

At present, FASTA support is limited and only allows for one-hot featurization, and doesn’t allow for sharding.

Parameters:
  • input_files (List[str]) – List of fasta files.
  • data_dir (str, optional (default None)) – Name of directory where featurized data is stored.
  • shard_size (int, optional (default None)) – For now, this argument is ignored and each FASTA file gets its own shard.
Returns:

A DiskDataset object containing a featurized representation of data from input_files.

Return type:

DiskDataset

featurize(inputs: Union[Any, Sequence[Any]], data_dir: Optional[str] = None, shard_size: Optional[int] = 8192) → deepchem.data.datasets.Dataset[source]

Featurize provided files and write to specified location.

DEPRECATED: This method is now a wrapper for create_dataset() and calls that method under the hood.

For large datasets, automatically shards into smaller chunks for convenience. This implementation assumes that the helper methods _get_shards and _featurize_shard are implemented and that each shard returned by _get_shards is a pandas dataframe. You may choose to reuse or override this method in your subclass implementations.

Parameters:
  • inputs (List) – List of inputs to process. Entries can be filenames or arbitrary objects.
  • data_dir (str, default None) – Directory to store featurized dataset.
  • shard_size (int, optional (default 8192)) – Number of examples stored in each shard.
Returns:

A Dataset object containing a featurized representation of data from inputs.

Return type:

Dataset

ImageLoader

class deepchem.data.ImageLoader(tasks: Optional[List[str]] = None)[source]

Handles loading of image files.

This class allows for loading of images in various formats. For user convenience, also accepts zip-files and directories of images and uses some limited intelligence to attempt to traverse subdirectories which contain images.

__init__(tasks: Optional[List[str]] = None)[source]

Initialize image loader.

At present, custom image featurizers aren’t supported by this loader class.

Parameters:tasks (List[str], optional (default None)) – List of task names for image labels.
create_dataset(inputs: Union[str, Sequence[str], Tuple[Any]], data_dir: Optional[str] = None, shard_size: Optional[int] = 8192, in_memory: bool = False) → deepchem.data.datasets.Dataset[source]

Creates and returns a Dataset object by featurizing provided image files and labels/weights.

Parameters:
  • inputs (Union[OneOrMany[str], Tuple[Any]]) –

    The inputs provided should be one of the following

    • filename
    • list of filenames
    • Tuple (list of filenames, labels)
    • Tuple (list of filenames, labels, weights)

    Each file in a given list of filenames should either be of a supported image format (.png, .tif only for now) or of a compressed folder of image files (only .zip for now). If labels or weights are provided, they must correspond to the sorted order of all filenames provided, with one label/weight per file.

  • data_dir (str, optional (default None)) – Directory to store featurized dataset.
  • shard_size (int, optional (default 8192)) – Shard size when loading data.
  • in_memory (bool, optioanl (default False)) – If true, return in-memory NumpyDataset. Else return ImageDataset.
Returns:

A Dataset object containing a featurized representation of data from input_files, labels, and weights.

Return type:

Dataset

featurize(inputs: Union[Any, Sequence[Any]], data_dir: Optional[str] = None, shard_size: Optional[int] = 8192) → deepchem.data.datasets.Dataset[source]

Featurize provided files and write to specified location.

DEPRECATED: This method is now a wrapper for create_dataset() and calls that method under the hood.

For large datasets, automatically shards into smaller chunks for convenience. This implementation assumes that the helper methods _get_shards and _featurize_shard are implemented and that each shard returned by _get_shards is a pandas dataframe. You may choose to reuse or override this method in your subclass implementations.

Parameters:
  • inputs (List) – List of inputs to process. Entries can be filenames or arbitrary objects.
  • data_dir (str, default None) – Directory to store featurized dataset.
  • shard_size (int, optional (default 8192)) – Number of examples stored in each shard.
Returns:

A Dataset object containing a featurized representation of data from inputs.

Return type:

Dataset

SDFLoader

class deepchem.data.SDFLoader(tasks: List[str], featurizer: deepchem.feat.base_classes.Featurizer, sanitize: bool = False, log_every_n: int = 1000)[source]

Creates a Dataset object from SDF input files.

This class provides conveniences to load and featurize data from Structure Data Files (SDFs). SDF is a standard format for structural information (3D coordinates of atoms and bonds) of molecular compounds.

Examples

>>> import deepchem as dc
>>> import os
>>> current_dir = os.path.dirname(os.path.realpath(__file__))
>>> featurizer = dc.feat.CircularFingerprint(size=16)
>>> loader = dc.data.SDFLoader(["LogP(RRCK)"], featurizer=featurizer, sanitize=True)
>>> dataset = loader.create_dataset(os.path.join(current_dir, "tests", "membrane_permeability.sdf")) # doctest:+ELLIPSIS
>>> len(dataset)
2
__init__(tasks: List[str], featurizer: deepchem.feat.base_classes.Featurizer, sanitize: bool = False, log_every_n: int = 1000)[source]

Initialize SDF Loader

Parameters:
  • tasks (list[str]) – List of tasknames. These will be loaded from the SDF file.
  • featurizer (Featurizer) – Featurizer to use to process data
  • sanitize (bool, optional (default False)) – Whether to sanitize molecules.
  • log_every_n (int, optional (default 1000)) – Writes a logging statement this often.
create_dataset(inputs: Union[Any, Sequence[Any]], data_dir: Optional[str] = None, shard_size: Optional[int] = 8192) → deepchem.data.datasets.Dataset[source]

Creates and returns a Dataset object by featurizing provided files.

Reads in inputs and uses self.featurizer to featurize the data in these inputs. For large files, automatically shards into smaller chunks of shard_size datapoints for convenience. Returns a Dataset object that contains the featurized dataset.

This implementation assumes that the helper methods _get_shards and _featurize_shard are implemented and that each shard returned by _get_shards is a pandas dataframe. You may choose to reuse or override this method in your subclass implementations.

Parameters:
  • inputs (List) – List of inputs to process. Entries can be filenames or arbitrary objects.
  • data_dir (str, optional (default None)) – Directory to store featurized dataset.
  • shard_size (int, optional (default 8192)) – Number of examples stored in each shard.
Returns:

A DiskDataset object containing a featurized representation of data from inputs.

Return type:

DiskDataset

featurize(inputs: Union[Any, Sequence[Any]], data_dir: Optional[str] = None, shard_size: Optional[int] = 8192) → deepchem.data.datasets.Dataset[source]

Featurize provided files and write to specified location.

DEPRECATED: This method is now a wrapper for create_dataset() and calls that method under the hood.

For large datasets, automatically shards into smaller chunks for convenience. This implementation assumes that the helper methods _get_shards and _featurize_shard are implemented and that each shard returned by _get_shards is a pandas dataframe. You may choose to reuse or override this method in your subclass implementations.

Parameters:
  • inputs (List) – List of inputs to process. Entries can be filenames or arbitrary objects.
  • data_dir (str, default None) – Directory to store featurized dataset.
  • shard_size (int, optional (default 8192)) – Number of examples stored in each shard.
Returns:

A Dataset object containing a featurized representation of data from inputs.

Return type:

Dataset

InMemoryLoader

The dc.data.InMemoryLoader is designed to facilitate the processing of large datasets where you already hold the raw data in-memory (say in a pandas dataframe).

class deepchem.data.InMemoryLoader(tasks: List[str], featurizer: deepchem.feat.base_classes.Featurizer, id_field: Optional[str] = None, log_every_n: int = 1000)[source]

Facilitate Featurization of In-memory objects.

When featurizing a dataset, it’s often the case that the initial set of data (pre-featurization) fits handily within memory. (For example, perhaps it fits within a column of a pandas DataFrame.) In this case, it would be convenient to directly be able to featurize this column of data. However, the process of featurization often generates large arrays which quickly eat up available memory. This class provides convenient capabilities to process such in-memory data by checkpointing generated features periodically to disk.

Example

Here’s an example with only datapoints and no labels or weights.

>>> import deepchem as dc
>>> smiles = ["C", "CC", "CCC", "CCCC"]
>>> featurizer = dc.feat.CircularFingerprint()
>>> loader = dc.data.InMemoryLoader(tasks=["task1"], featurizer=featurizer)
>>> dataset = loader.create_dataset(smiles, shard_size=2)
>>> len(dataset)
4

Here’s an example with both datapoints and labels

>>> import deepchem as dc
>>> smiles = ["C", "CC", "CCC", "CCCC"]
>>> labels = [1, 0, 1, 0]
>>> featurizer = dc.feat.CircularFingerprint()
>>> loader = dc.data.InMemoryLoader(tasks=["task1"], featurizer=featurizer)
>>> dataset = loader.create_dataset(zip(smiles, labels), shard_size=2)
>>> len(dataset)
4

Here’s an example with datapoints, labels, weights and ids all provided.

>>> import deepchem as dc
>>> smiles = ["C", "CC", "CCC", "CCCC"]
>>> labels = [1, 0, 1, 0]
>>> weights = [1.5, 0, 1.5, 0]
>>> ids = ["C", "CC", "CCC", "CCCC"]
>>> featurizer = dc.feat.CircularFingerprint()
>>> loader = dc.data.InMemoryLoader(tasks=["task1"], featurizer=featurizer)
>>> dataset = loader.create_dataset(zip(smiles, labels, weights, ids), shard_size=2)
>>> len(dataset)
4
__init__(tasks: List[str], featurizer: deepchem.feat.base_classes.Featurizer, id_field: Optional[str] = None, log_every_n: int = 1000)[source]

Construct a DataLoader object.

This constructor is provided as a template mainly. You shouldn’t ever call this constructor directly as a user.

Parameters:
  • tasks (List[str]) – List of task names
  • featurizer (Featurizer) – Featurizer to use to process data.
  • id_field (str, optional (default None)) – Name of field that holds sample identifier. Note that the meaning of “field” depends on the input data type and can have a different meaning in different subclasses. For example, a CSV file could have a field as a column, and an SDF file could have a field as molecular property.
  • log_every_n (int, optional (default 1000)) – Writes a logging statement this often.
create_dataset(inputs: Sequence[Any], data_dir: Optional[str] = None, shard_size: Optional[int] = 8192) → deepchem.data.datasets.DiskDataset[source]

Creates and returns a Dataset object by featurizing provided files.

Reads in inputs and uses self.featurizer to featurize the data in these input files. For large files, automatically shards into smaller chunks of shard_size datapoints for convenience. Returns a Dataset object that contains the featurized dataset.

This implementation assumes that the helper methods _get_shards and _featurize_shard are implemented and that each shard returned by _get_shards is a pandas dataframe. You may choose to reuse or override this method in your subclass implementations.

Parameters:
  • inputs (Sequence[Any]) – List of inputs to process. Entries can be arbitrary objects so long as they are understood by self.featurizer
  • data_dir (str, optional (default None)) – Directory to store featurized dataset.
  • shard_size (int, optional (default 8192)) – Number of examples stored in each shard.
Returns:

A DiskDataset object containing a featurized representation of data from inputs.

Return type:

DiskDataset

featurize(inputs: Union[Any, Sequence[Any]], data_dir: Optional[str] = None, shard_size: Optional[int] = 8192) → deepchem.data.datasets.Dataset[source]

Featurize provided files and write to specified location.

DEPRECATED: This method is now a wrapper for create_dataset() and calls that method under the hood.

For large datasets, automatically shards into smaller chunks for convenience. This implementation assumes that the helper methods _get_shards and _featurize_shard are implemented and that each shard returned by _get_shards is a pandas dataframe. You may choose to reuse or override this method in your subclass implementations.

Parameters:
  • inputs (List) – List of inputs to process. Entries can be filenames or arbitrary objects.
  • data_dir (str, default None) – Directory to store featurized dataset.
  • shard_size (int, optional (default 8192)) – Number of examples stored in each shard.
Returns:

A Dataset object containing a featurized representation of data from inputs.

Return type:

Dataset