
The DeepChem library is packaged alongside the MoleculeNet suite of datasets. One of the most important parts of machine learning applications is finding a suitable dataset. The MoleculeNet suite has curated a whole range of datasets and loaded them into DeepChem objects for convenience.

MoleculeNet Cheatsheet

When training a model or performing a benchmark, the user needs specific datasets. However, at the beginning, this search can be exhaustive and confusing. The following cheatsheet is aimed at helping DeepChem users identify more easily which dataset to use depending on their purposes.

Each row reprents a dataset where a brief description is given. Also, the columns represents the type of the data; depending on molecule properties, images or materials and how many data points they have. Each dataset is referenced with a link of the paper. Finally, there are some entries that need further information.


MoleculeNet description




Data Points


BACE (Regression)

Provides bindings results for a set of inhibitors of human beta-secretase (BACE-1)




BACE (Classification)

Provides bindings results for a set of inhibitors of human beta-secretase (BACE-1)





Images of HT29 colon cancer cells





Images of Drosophilia Kc167 cells





DIC Images of Mouse Embryos





Synthetic Images of clustered nuclei





Synthetic Images of clustered nuclei





Blood-Brain Barrier Penetration designed for the modeling and prediction of barrier permeability

Binary labels on permeability properties



Cell Counting

Synthetic emulations of fluorescence microscopic images of bacterial cells




ChEMBL (set = ‘sparse’)

A sparse subset of ChEMBL with activity data for one target


244 245


ChEMBL (set = ‘5thresh’)

A subset of ChEMBL with activity data for at least five targets


23 871








Compares drugs approved by the FDA and drugs that have failed clinical trials for toxicity reasons.





A regression dataset containing structures and water solubility data





Merck in-house compounds that were measured for IC50 of inhibition on 12 serine proteases




A collection of experimental and calculated hydration free energies for small molecules in water





A dataset wich tested the ability to inhibit HIV replication


40 000



Harvard Organic Photovoltaic dataset utilized as p-type materials




Thermosynamic solubility datasets


in-house compounds that were measured on 15 enzyme inhibition and ADME/TOX datasets.


100 000



In-house compounds that were measured for IC50 of inhibition on 99 protein kinases


2 500


Experimental results of octanol/water distribution coefficient (logD at pH 7.4)


4 200


Band Gap

Experimentally measured band gaps for inorganic crystal structure


4 604



Contains Perovskite structures and their formation energies


18 928


MP Formation Energy

Contains calculated formation energies and inorganic crystal structures from the Materials Project database


132 752


MP Metallicity

Contains inorganic crystal structures from the Materials Project database labeled as metals or nonmetals


106 113



Benchmark dataset selected from PubChem BioAssay by applying a refined nearest neighbor analysis


90 000




Database consisting of biological activities of small molecules generated by high-throughput screening


400 000



Experimental binding affinity data and structures of protein-ligand complexes


“refined set”  4 852 - “general set” 12 800 - “core set” 193




Subset of GDB-13  containing up to 7 heavy atoms CNOS


7 165



Dataset used in a study on modeling quantum mechanical calculations of electronic spectra and excited state energy of small molecules


20 000



Dataset that provides geometric/energetic/electronic and thermodynamic properties for a subset of GDB-17 database


134 000



Similat to FreeSolv dataset which provides experimental and calculated hydration free energy of small molecules in water


The Side Effect Resource (SIDER) is a database of marketed drugs and adverse drug reactions (ADR)


1 427



Thermodynamic solubility datasets


The “Toxicology in the 21st Century” (Tox21) initiative created a public database measuring the toxicity of compounds


8 000



Toxicology data for an extensive library of compounds based on in vitro high-throughput screening


8 000



Subsets of USPTO dataset of organic chemical reactions extracted from US patents and patent applications

Chemical reactions SMILES

MIT 479 000 - STEREO 1 M - 50K 50 000



The UV dataset tests Merck’s internal compounds on 190 absorption wavelengths between 210 and 400 nm


10 000


Purchasable compounds for virtual screening of small molecules to identify structures that are likely to bind to drug targets


250K - 1M - 10M


Platinum Adsorption

Different configurations of Adsorbates (i.e N and NO) on Platinum surface represented as Lattice and their formation energy

Adsorbate Configurations


Contributing a new dataset to MoleculeNet

If you are proposing a new dataset to be included in the MoleculeNet benchmarking suite, please follow the instructions below. Please review the datasets already available in MolNet before contributing.

  1. Read the Contribution guidelines.

  2. Open an issue to discuss the dataset you want to add to MolNet.

  3. Write a DatasetLoader class that inherits from deepchem.molnet.load_function.molnet_loader._MolnetLoader and implements a create_dataset method. See the _QM9Loader for a simple example.

  4. Write a load_dataset function that documents the dataset and add your load function to for easy importing.

  5. Prepare your dataset as a .tar.gz or .zip file. Accepted filetypes include CSV, JSON, and SDF.

  6. Ask a member of the technical steering committee to add your .tar.gz or .zip file to the DeepChem AWS bucket. Modify your load function to pull down the dataset from AWS.

  7. Add documentation for your loader to the MoleculeNet docs.

  8. Submit a [WIP] PR (Work in progress pull request) following the PR template.

Example Usage

Below is an example of how to load a MoleculeNet dataset and featurizer. This approach will work for any dataset in MoleculeNet by changing the load function and featurizer. For more details on the featurizers, see the Featurizers section.

import deepchem as dc
from deepchem.feat.molecule_featurizers import MolGraphConvFeaturizer

featurizer = MolGraphConvFeaturizer(use_edges=True)
dataset_dc = dc.molnet.load_qm9(featurizer=featurizer)
tasks, dataset, transformers = dataset_dc
train, valid, test = dataset

x,y,w,ids = train.X, train.y, train.w, train.ids

Note that the “w” matrix represents the weight of each sample. Some assays may have missing values, in which case the weight is 0. Otherwise, the weight is 1.

Additionally, the environment variable DEEPCHEM_DATA_DIR can be set like os.environ['DEEPCHEM_DATA_DIR'] = path/to/store/featurized/dataset. When the DEEPCHEM_DATA_DIR environment variable is set, molnet loader stores the featurized dataset in the specified directory and when the dataset has to be reloaded the next time, it will be fetched from the data directory directly rather than featurizing the raw dataset from scratch.

BACE Dataset

load_bace_classification(featurizer: Featurizer | str = 'ECFP', splitter: Splitter | str | None = 'scaffold', transformers: List[TransformerGenerator | str] = ['balancing'], reload: bool = True, data_dir: str | None = None, save_dir: str | None = None, **kwargs) Tuple[List[str], Tuple[Dataset, ...], List[Transformer]][source]

Load BACE dataset with classification labels.

BACE dataset with classification labels (“class”). The BACE dataset contains 1513 compounds and the dataset is a binary classification dataset with labels 0 or 1.

  • featurizer (Featurizer or str) – the featurizer to use for processing the data. Alternatively you can pass one of the names from dc.molnet.featurizers as a shortcut.

  • splitter (Splitter or str) – the splitter to use for splitting the data into training, validation, and test sets. Alternatively you can pass one of the names from dc.molnet.splitters as a shortcut. If this is None, all the data will be included in a single dataset.

  • transformers (list of TransformerGenerators or strings) – the Transformers to apply to the data. Each one is specified by a TransformerGenerator or, as a shortcut, one of the names from dc.molnet.transformers.

  • reload (bool) – if True, the first call for a particular featurizer and splitter will cache the datasets to disk, and subsequent calls will reload the cached datasets.

  • data_dir (str) – a directory to save the raw data in

  • save_dir (str) – a directory to save the dataset in

load_bace_regression(featurizer: Featurizer | str = 'ECFP', splitter: Splitter | str | None = 'scaffold', transformers: List[TransformerGenerator | str] = ['normalization'], reload: bool = True, data_dir: str | None = None, save_dir: str | None = None, **kwargs) Tuple[List[str], Tuple[Dataset, ...], List[Transformer]][source]

Load BACE dataset, regression labels

The BACE dataset provides quantitative IC50 and qualitative (binary label) binding results for a set of inhibitors of human beta-secretase 1 (BACE-1).

All data are experimental values reported in scientific literature over the past decade, some with detailed crystal structures available. A collection of 1522 compounds is provided, along with the regression labels of IC50. The number of tasks in the dataset is one.

Scaffold splitting is recommended for this dataset.

The raw data csv file contains columns below:

  • “mol” - SMILES representation of the molecular structure

  • “pIC50” - Negative log of the IC50 binding affinity

  • “class” - Binary labels for inhibitor

  • featurizer (Featurizer or str) – the featurizer to use for processing the data. Alternatively you can pass one of the names from dc.molnet.featurizers as a shortcut.

  • splitter (Splitter or str) – the splitter to use for splitting the data into training, validation, and test sets. Alternatively you can pass one of the names from dc.molnet.splitters as a shortcut. If this is None, all the data will be included in a single dataset.

  • transformers (list of TransformerGenerators or strings) – the Transformers to apply to the data. Each one is specified by a TransformerGenerator or, as a shortcut, one of the names from dc.molnet.transformers.

  • reload (bool) – if True, the first call for a particular featurizer and splitter will cache the datasets to disk, and subsequent calls will reload the cached datasets.

  • data_dir (str) – a directory to save the raw data in

  • save_dir (str) – a directory to save the dataset in


BBBC Datasets

load_bbbc001(splitter: Splitter | str | None = 'index', transformers: List[TransformerGenerator | str] = [], reload: bool = True, data_dir: str | None = None, save_dir: str | None = None, **kwargs) Tuple[List[str], Tuple[Dataset, ...], List[Transformer]][source]

Load BBBC001 dataset

This dataset contains 6 images of human HT29 colon cancer cells. The task is to learn to predict the cell counts in these images. This dataset is too small to serve to train algorithms, but might serve as a good test dataset.

  • splitter (Splitter or str) – the splitter to use for splitting the data into training, validation, and test sets. Alternatively you can pass one of the names from dc.molnet.splitters as a shortcut. If this is None, all the data will be included in a single dataset.

  • transformers (list of TransformerGenerators or strings) – the Transformers to apply to the data. Each one is specified by a TransformerGenerator or, as a shortcut, one of the names from dc.molnet.transformers.

  • reload (bool) – if True, the first call for a particular featurizer and splitter will cache the datasets to disk, and subsequent calls will reload the cached datasets.

  • data_dir (str) – a directory to save the raw data in

  • save_dir (str) – a directory to save the dataset in

load_bbbc002(splitter: Splitter | str | None = 'index', transformers: List[TransformerGenerator | str] = [], reload: bool = True, data_dir: str | None = None, save_dir: str | None = None, **kwargs) Tuple[List[str], Tuple[Dataset, ...], List[Transformer]][source]

Load BBBC002 dataset

This dataset contains data corresponding to 5 samples of Drosophilia Kc167 cells. There are 10 fields of view for each sample, each an image of size 512x512. Ground truth labels contain cell counts for this dataset. Full details about this dataset are present at

  • splitter (Splitter or str) – the splitter to use for splitting the data into training, validation, and test sets. Alternatively you can pass one of the names from dc.molnet.splitters as a shortcut. If this is None, all the data will be included in a single dataset.

  • transformers (list of TransformerGenerators or strings) – the Transformers to apply to the data. Each one is specified by a TransformerGenerator or, as a shortcut, one of the names from dc.molnet.transformers.

  • reload (bool) – if True, the first call for a particular featurizer and splitter will cache the datasets to disk, and subsequent calls will reload the cached datasets.

  • data_dir (str) – a directory to save the raw data in

  • save_dir (str) – a directory to save the dataset in

load_bbbc003(load_segmentation_mask: bool = False, splitter: Splitter | str | None = 'index', transformers: List[TransformerGenerator | str] = [], reload: bool = True, data_dir: str | None = None, save_dir: str | None = None, **kwargs) Tuple[List[str], Tuple[Dataset, ...], List[Transformer]][source]

Load BBBC003 dataset

This dataset contains data corresponding to 15 samples of Mouse embryos with DIC. Each image is of size 640x480. Ground truth labels contain cell counts and segmentation masks for this dataset. Full details about this dataset are present at

  • load_segmentation_mask (bool) – if True, the dataset will contain segmentation masks as labels. Otherwise, the dataset will contain cell counts as labels.

  • splitter (Splitter or str) – the splitter to use for splitting the data into training, validation, and test sets. Alternatively you can pass one of the names from dc.molnet.splitters as a shortcut. If this is None, all the data will be included in a single dataset.

  • transformers (list of TransformerGenerators or strings) – the Transformers to apply to the data. Each one is specified by a TransformerGenerator or, as a shortcut, one of the names from dc.molnet.transformers.

  • reload (bool) – if True, the first call for a particular featurizer and splitter will cache the datasets to disk, and subsequent calls will reload the cached datasets.

  • data_dir (str) – a directory to save the raw data in

  • save_dir (str) – a directory to save the dataset in


Importing necessary modules

>>> import deepchem as dc
>>> import numpy as np

We can load the BBBC003 dataset with 2 types of labels: segmentation masks and cell counts. We will first load the dataset with cell counts as labels.

>>> loader = dc.molnet.load_bbbc003(load_segmentation_mask=False)
>>> tasks, dataset, transformers = loader
>>> train, val, test = dataset

We now have a dataset with 15 samples, each with 300 cells. The images are of size 640x480. The labels are cell counts. We can verify this as follows:

>>> train.X.shape
>>> train.y.shape

We will now load the dataset with segmentation masks as labels.

>>> loader = dc.molnet.load_bbbc003(load_segmentation_mask=True)
>>> tasks, dataset, transformers = loader
>>> train, val, test = dataset

We now have a dataset with 15 samples, each with 300 cells. The images are of size 640x480. The labels are segmentation masks. We can verify this as follows:

>>> print(train.X.shape)
>>> print(train.y.shape)

Note: The image labelled ‘7_19_M2E15.tif’ is transposed to 480x640 in the source file along with it’s segementation mask. To match it with the other images, we need to transpose it back to 640x480.

This image is found at index 6 in the train dataset (Assuming no shuffling has taken place).

First, we load the dataset as usual and split it into X, y, w and ids. Here, X is the list of input images, y is the list of labels, w is the list of weights and ids is the list of IDs for each sample.

>>> train_x, train_y, train_w, train_ids = train.X, train.y, train.w, train.ids

We can now transpose the image at index 6 in the input data (train_x): >>> train_x[6] = train_x[6].T

We can now verify that the image is of size 640x480: >>> print(train_x[6].shape) (640, 480)

This is also seen in the segmentation mask with the same filename and index, in which case, we transpose the label (train_y) instead of the input data:

>>> train_y[6] = train_y[6].T

We can now verify that the image is of size 640x480: >>> train_y[6].shape (640, 480)

load_bbbc004(overlap_probability: float = 0.0, load_segmentation_mask: bool = False, splitter: Splitter | str | None = 'index', transformers: List[TransformerGenerator | str] = [], reload: bool = True, data_dir: str | None = None, save_dir: str | None = None, **kwargs) Tuple[List[str], Tuple[Dataset, ...], List[Transformer]][source]

Load BBBC004 dataset

This dataset contains data corresponding to 20 samples of synthetically generated fluorescent cell population images. There are 300 cells in each sample, each an image of size 950x950. Ground truth labels contain cell counts and segmentation masks for this dataset. Full details about this dataset are present at

  • overlap_probability (float from list {0.0, 0.15, 0.3, 0.45, 0.6}) – the overlap probability of the synthetic cells in the images

  • load_segmentation_mask (bool) – if True, the dataset will contain segmentation masks as labels. Otherwise, the dataset will contain cell counts as labels.

  • splitter (Splitter or str) – the splitter to use for splitting the data into training, validation, and test sets. Alternatively you can pass one of the names from dc.molnet.splitters as a shortcut. If this is None, all the data will be included in a single dataset.

  • transformers (list of TransformerGenerators or strings) – the Transformers to apply to the data. Each one is specified by a TransformerGenerator or, as a shortcut, one of the names from dc.molnet.transformers.

  • reload (bool) – if True, the first call for a particular featurizer and splitter will cache the datasets to disk, and subsequent calls will reload the cached datasets.

  • data_dir (str) – a directory to save the raw data in

  • save_dir (str) – a directory to save the dataset in


Importing necessary modules

>>> import deepchem as dc
>>> import numpy as np

We can load the BBBC004 dataset with 2 types of labels: segmentation masks and cell counts. We will first load the dataset with cell counts as labels.

>>> loader = dc.molnet.load_bbbc004(overlap_probability=0.0, load_segmentation_mask=False)
>>> tasks, dataset, transformers = loader
>>> train, val, test = dataset

We now have a dataset with 20 samples, each with 300 cells. The images are of size 950x950. The labels are cell counts. We can verify this as follows:

>>> train.X.shape
(16, 950, 950)
>>> train.y.shape

We will now load the dataset with segmentation masks as labels.

>>> loader = dc.molnet.load_bbbc004(overlap_probability=0.0, load_segmentation_mask=True)
>>> tasks, dataset, transformers = loader
>>> train, val, test = dataset

We now have a dataset with 20 samples, each with 300 cells. The images are of size 950x950. The labels are segmentation masks. We can verify this as follows:

>>> train.X.shape
(16, 950, 950)
>>> train.y.shape
(16, 950, 950, 3)
load_bbbc005(splitter: Splitter | str | None = 'index', transformers: List[TransformerGenerator | str] = [], reload: bool = True, data_dir: str | None = None, save_dir: str | None = None, **kwargs) Tuple[List[str], Tuple[Dataset, ...], List[Transformer]][source]

Load BBBC005 dataset

This dataset contains data corresponding to 19,200 samples of synthetically generated fluorescent cell population images. These images were simulated for a given cell count with a clustering probablity of 25% and a CCD noise variance of 0.0001. Focus blur was simulated by applying varying Guassian filters to the images. Each image is of size 520x696. Ground truth labels contain cell counts for this dataset. Full details about this dataset are present at

  • splitter (Splitter or str) – the splitter to use for splitting the data into training, validation, and test sets. Alternatively you can pass one of the names from dc.molnet.splitters as a shortcut. If this is None, all the data will be included in a single dataset.

  • transformers (list of TransformerGenerators or strings) – the Transformers to apply to the data. Each one is specified by a TransformerGenerator or, as a shortcut, one of the names from dc.molnet.transformers.

  • reload (bool) – if True, the first call for a particular featurizer and splitter will cache the datasets to disk, and subsequent calls will reload the cached datasets.

  • data_dir (str) – a directory to save the raw data in

  • save_dir (str) – a directory to save the dataset in


Importing necessary modules

>> import deepchem as dc >> import numpy as np

We will now load the BBBC005 dataset with cell counts as labels.

>> loader = dc.molnet.load_bbbc005() >> tasks, dataset, transformers = loader >> train, val, test = dataset

We now have a dataset with a total of 19,200 samples with cell counts in the range of 1-100. The images are of size 520x696. The labels are cell counts. We have a train-val-test split of 80:10:10. We can verify this as follows:

>> train.X.shape (15360, 520, 696) >> train.y.shape (15360,)

BBBP Datasets

BBBP stands for Blood-Brain-Barrier Penetration

load_bbbp(featurizer: Featurizer | str = 'ECFP', splitter: Splitter | str | None = 'scaffold', transformers: List[TransformerGenerator | str] = ['balancing'], reload: bool = True, data_dir: str | None = None, save_dir: str | None = None, **kwargs) Tuple[List[str], Tuple[Dataset, ...], List[Transformer]][source]

Load BBBP dataset

The blood-brain barrier penetration (BBBP) dataset is designed for the modeling and prediction of barrier permeability. As a membrane separating circulating blood and brain extracellular fluid, the blood-brain barrier blocks most drugs, hormones and neurotransmitters. Thus penetration of the barrier forms a long-standing issue in development of drugs targeting central nervous system.

This dataset includes binary labels for over 2000 compounds on their permeability properties.

Scaffold splitting is recommended for this dataset.

The raw data csv file contains columns below:

  • “name” - Name of the compound

  • “smiles” - SMILES representation of the molecular structure

  • “p_np” - Binary labels for penetration/non-penetration

  • featurizer (Featurizer or str) – the featurizer to use for processing the data. Alternatively you can pass one of the names from dc.molnet.featurizers as a shortcut.

  • splitter (Splitter or str) – the splitter to use for splitting the data into training, validation, and test sets. Alternatively you can pass one of the names from dc.molnet.splitters as a shortcut. If this is None, all the data will be included in a single dataset.

  • transformers (list of TransformerGenerators or strings) – the Transformers to apply to the data. Each one is specified by a TransformerGenerator or, as a shortcut, one of the names from dc.molnet.transformers.

  • reload (bool) – if True, the first call for a particular featurizer and splitter will cache the datasets to disk, and subsequent calls will reload the cached datasets.

  • data_dir (str) – a directory to save the raw data in

  • save_dir (str) – a directory to save the dataset in


Cell Counting Datasets

load_cell_counting(splitter: Splitter | str | None = None, transformers: List[TransformerGenerator | str] = [], reload: bool = True, data_dir: str | None = None, save_dir: str | None = None, **kwargs) Tuple[List[str], Tuple[Dataset, ...], List[Transformer]][source]

Load Cell Counting dataset.

Loads the cell counting dataset from

  • splitter (Splitter or str) – the splitter to use for splitting the data into training, validation, and test sets. Alternatively you can pass one of the names from dc.molnet.splitters as a shortcut. If this is None, all the data will be included in a single dataset.

  • transformers (list of TransformerGenerators or strings) – the Transformers to apply to the data. Each one is specified by a TransformerGenerator or, as a shortcut, one of the names from dc.molnet.transformers.

  • reload (bool) – if True, the first call for a particular featurizer and splitter will cache the datasets to disk, and subsequent calls will reload the cached datasets.

  • data_dir (str) – a directory to save the raw data in

  • save_dir (str) – a directory to save the dataset in

Chembl Datasets

load_chembl(featurizer: Featurizer | str = 'ECFP', splitter: Splitter | str | None = 'scaffold', transformers: List[TransformerGenerator | str] = ['normalization'], set: str = '5thresh', reload: bool = True, data_dir: str | None = None, save_dir: str | None = None, **kwargs) Tuple[List[str], Tuple[Dataset, ...], List[Transformer]][source]

Load the ChEMBL dataset.

This dataset is based on release 22.1 of the data from Two subsets of the data are available, depending on the “set” argument. “sparse” is a large dataset with 244,245 compounds. As the name suggests, the data is extremely sparse, with most compounds having activity data for only one target. “5thresh” is a much smaller set (23,871 compounds) that includes only compounds with activity data for at least five targets.

  • featurizer (Featurizer or str) – the featurizer to use for processing the data. Alternatively you can pass one of the names from dc.molnet.featurizers as a shortcut.

  • splitter (Splitter or str) – the splitter to use for splitting the data into training, validation, and test sets. Alternatively you can pass one of the names from dc.molnet.splitters as a shortcut. If this is None, all the data will be included in a single dataset.

  • transformers (list of TransformerGenerators or strings) – the Transformers to apply to the data. Each one is specified by a TransformerGenerator or, as a shortcut, one of the names from dc.molnet.transformers.

  • set (str) – the subset to load, either “sparse” or “5thresh”

  • reload (bool) – if True, the first call for a particular featurizer and splitter will cache the datasets to disk, and subsequent calls will reload the cached datasets.

  • data_dir (str) – a directory to save the raw data in

  • save_dir (str) – a directory to save the dataset in

Chembl25 Datasets

load_chembl25(featurizer: Featurizer | str = 'ECFP', splitter: Splitter | str | None = 'scaffold', transformers: List[TransformerGenerator | str] = ['normalization'], reload: bool = True, data_dir: str | None = None, save_dir: str | None = None, **kwargs) Tuple[List[str], Tuple[Dataset, ...], List[Transformer]][source]

Loads the ChEMBL25 dataset, featurizes it, and does a split.

  • featurizer (Featurizer or str) – the featurizer to use for processing the data. Alternatively you can pass one of the names from dc.molnet.featurizers as a shortcut.

  • splitter (Splitter or str) – the splitter to use for splitting the data into training, validation, and test sets. Alternatively you can pass one of the names from dc.molnet.splitters as a shortcut. If this is None, all the data will be included in a single dataset.

  • transformers (list of TransformerGenerators or strings) – the Transformers to apply to the data. Each one is specified by a TransformerGenerator or, as a shortcut, one of the names from dc.molnet.transformers.

  • reload (bool) – if True, the first call for a particular featurizer and splitter will cache the datasets to disk, and subsequent calls will reload the cached datasets.

  • data_dir (str) – a directory to save the raw data in

  • save_dir (str) – a directory to save the dataset in

Clearance Datasets

load_clearance(featurizer: Featurizer | str = 'ECFP', splitter: Splitter | str | None = 'scaffold', transformers: List[TransformerGenerator | str] = ['log'], reload: bool = True, data_dir: str | None = None, save_dir: str | None = None, **kwargs) Tuple[List[str], Tuple[Dataset, ...], List[Transformer]][source]

Load clearance datasets.

  • featurizer (Featurizer or str) – the featurizer to use for processing the data. Alternatively you can pass one of the names from dc.molnet.featurizers as a shortcut.

  • splitter (Splitter or str) – the splitter to use for splitting the data into training, validation, and test sets. Alternatively you can pass one of the names from dc.molnet.splitters as a shortcut. If this is None, all the data will be included in a single dataset.

  • transformers (list of TransformerGenerators or strings) – the Transformers to apply to the data. Each one is specified by a TransformerGenerator or, as a shortcut, one of the names from dc.molnet.transformers.

  • reload (bool) – if True, the first call for a particular featurizer and splitter will cache the datasets to disk, and subsequent calls will reload the cached datasets.

  • data_dir (str) – a directory to save the raw data in

  • save_dir (str) – a directory to save the dataset in

Clintox Datasets

load_clintox(featurizer: Featurizer | str = 'ECFP', splitter: Splitter | str | None = 'scaffold', transformers: List[TransformerGenerator | str] = ['balancing'], reload: bool = True, data_dir: str | None = None, save_dir: str | None = None, **kwargs) Tuple[List[str], Tuple[Dataset, ...], List[Transformer]][source]

Load ClinTox dataset

The ClinTox dataset compares drugs approved by the FDA and drugs that have failed clinical trials for toxicity reasons. The dataset includes two classification tasks for 1491 drug compounds with known chemical structures:

  1. clinical trial toxicity (or absence of toxicity)

  2. FDA approval status.

List of FDA-approved drugs are compiled from the SWEETLEAD database, and list of drugs that failed clinical trials for toxicity reasons are compiled from the Aggregate Analysis of database.

Random splitting is recommended for this dataset.

The raw data csv file contains columns below:

  • “smiles” - SMILES representation of the molecular structure

  • “FDA_APPROVED” - FDA approval status

  • “CT_TOX” - Clinical trial results

  • featurizer (Featurizer or str) – the featurizer to use for processing the data. Alternatively you can pass one of the names from dc.molnet.featurizers as a shortcut.

  • splitter (Splitter or str) – the splitter to use for splitting the data into training, validation, and test sets. Alternatively you can pass one of the names from dc.molnet.splitters as a shortcut. If this is None, all the data will be included in a single dataset.

  • transformers (list of TransformerGenerators or strings) – the Transformers to apply to the data. Each one is specified by a TransformerGenerator or, as a shortcut, one of the names from dc.molnet.transformers.

  • reload (bool) – if True, the first call for a particular featurizer and splitter will cache the datasets to disk, and subsequent calls will reload the cached datasets.

  • data_dir (str) – a directory to save the raw data in

  • save_dir (str) – a directory to save the dataset in


Delaney Datasets

load_delaney(featurizer: Featurizer | str = 'ECFP', splitter: Splitter | str | None = 'scaffold', transformers: List[TransformerGenerator | str] = ['normalization'], reload: bool = True, data_dir: str | None = None, save_dir: str | None = None, **kwargs) Tuple[List[str], Tuple[Dataset, ...], List[Transformer]][source]

Load Delaney dataset

The Delaney (ESOL) dataset a regression dataset containing structures and water solubility data for 1128 compounds. The dataset is widely used to validate machine learning models on estimating solubility directly from molecular structures (as encoded in SMILES strings).

Scaffold splitting is recommended for this dataset.

The raw data csv file contains columns below:

  • “Compound ID” - Name of the compound

  • “smiles” - SMILES representation of the molecular structure

  • “measured log solubility in mols per litre” - Log-scale water solubility

    of the compound, used as label

  • featurizer (Featurizer or str) – the featurizer to use for processing the data. Alternatively you can pass one of the names from dc.molnet.featurizers as a shortcut.

  • splitter (Splitter or str) – the splitter to use for splitting the data into training, validation, and test sets. Alternatively you can pass one of the names from dc.molnet.splitters as a shortcut. If this is None, all the data will be included in a single dataset.

  • transformers (list of TransformerGenerators or strings) – the Transformers to apply to the data. Each one is specified by a TransformerGenerator or, as a shortcut, one of the names from dc.molnet.transformers.

  • reload (bool) – if True, the first call for a particular featurizer and splitter will cache the datasets to disk, and subsequent calls will reload the cached datasets.

  • data_dir (str) – a directory to save the raw data in

  • save_dir (str) – a directory to save the dataset in


Factors Datasets

load_factors(shard_size=2000, featurizer=None, split=None, reload=True)[source]

Loads FACTOR dataset; does not do train/test split

The Factors dataset is an in-house dataset from Merck that was first introduced in the following paper: Ramsundar, Bharath, et al. “Is multitask deep learning practical for pharma?.” Journal of chemical information and modeling 57.8 (2017): 2068-2076.

It contains 1500 Merck in-house compounds that were measured for IC50 of inhibition on 12 serine proteases. Unlike most of the other datasets featured in MoleculeNet, the Factors collection does not have structures for the compounds tested since they were proprietary Merck compounds. However, the collection does feature pre-computed descriptors for these compounds.

Note that the original train/valid/test split from the source data was preserved here, so this function doesn’t allow for alternate modes of splitting. Similarly, since the source data came pre-featurized, it is not possible to apply alternative featurizations.

  • shard_size (int, optional) – Size of the DiskDataset shards to write on disk

  • featurizer (optional) – Ignored since featurization pre-computed

  • split (optional) – Ignored since split pre-computed

  • reload (bool, optional) – Whether to automatically re-load from disk

Freesolv Dataset

load_freesolv(featurizer: ~deepchem.feat.base_classes.Featurizer | str = MATFeaturizer[], splitter: ~deepchem.splits.splitters.Splitter | str | None = 'random', transformers: ~typing.List[~deepchem.molnet.load_function.molnet_loader.TransformerGenerator | str] = ['normalization'], reload: bool = True, data_dir: str | None = None, save_dir: str | None = None, **kwargs) Tuple[List[str], Tuple[Dataset, ...], List[Transformer]][source]

Load Freesolv dataset

The FreeSolv dataset is a collection of experimental and calculated hydration free energies for small molecules in water, along with their experiemental values. Here, we are using a modified version of the dataset with the molecule smile string and the corresponding experimental hydration free energies.

Random splitting is recommended for this dataset.

The raw data csv file contains columns below:

  • “mol” - SMILES representation of the molecular structure

  • “y” - Experimental hydration free energy

  • featurizer (Featurizer or str) – the featurizer to use for processing the data. Alternatively you can pass one of the names from dc.molnet.featurizers as a shortcut.

  • splitter (Splitter or str) – the splitter to use for splitting the data into training, validation, and test sets. Alternatively you can pass one of the names from dc.molnet.splitters as a shortcut. If this is None, all the data will be included in a single dataset.

  • transformers (list of TransformerGenerators or strings) – the Transformers to apply to the data. Each one is specified by a TransformerGenerator or, as a shortcut, one of the names from dc.molnet.transformers.

  • reload (bool) – if True, the first call for a particular featurizer and splitter will cache the datasets to disk, and subsequent calls will reload the cached datasets.

  • data_dir (str) – a directory to save the raw data in

  • save_dir (str) – a directory to save the dataset in


HIV Datasets

load_hiv(featurizer: Featurizer | str = 'ECFP', splitter: Splitter | str | None = 'scaffold', transformers: List[TransformerGenerator | str] = ['balancing'], reload: bool = True, data_dir: str | None = None, save_dir: str | None = None, **kwargs) Tuple[List[str], Tuple[Dataset, ...], List[Transformer]][source]

Load HIV dataset

The HIV dataset was introduced by the Drug Therapeutics Program (DTP) AIDS Antiviral Screen, which tested the ability to inhibit HIV replication for over 40,000 compounds. Screening results were evaluated and placed into three categories: confirmed inactive (CI),confirmed active (CA) and confirmed moderately active (CM). We further combine the latter two labels, making it a classification task between inactive (CI) and active (CA and CM).

Scaffold splitting is recommended for this dataset.

The raw data csv file contains columns below:

  • “smiles”: SMILES representation of the molecular structure

  • “activity”: Three-class labels for screening results: CI/CM/CA

  • “HIV_active”: Binary labels for screening results: 1 (CA/CM) and 0 (CI)

  • featurizer (Featurizer or str) – the featurizer to use for processing the data. Alternatively you can pass one of the names from dc.molnet.featurizers as a shortcut.

  • splitter (Splitter or str) – the splitter to use for splitting the data into training, validation, and test sets. Alternatively you can pass one of the names from dc.molnet.splitters as a shortcut. If this is None, all the data will be included in a single dataset.

  • transformers (list of TransformerGenerators or strings) – the Transformers to apply to the data. Each one is specified by a TransformerGenerator or, as a shortcut, one of the names from dc.molnet.transformers.

  • reload (bool) – if True, the first call for a particular featurizer and splitter will cache the datasets to disk, and subsequent calls will reload the cached datasets.

  • data_dir (str) – a directory to save the raw data in

  • save_dir (str) – a directory to save the dataset in


HOPV Datasets

HOPV stands for the Harvard Organic Photovoltaic Dataset.

load_hopv(featurizer: Featurizer | str = 'ECFP', splitter: Splitter | str | None = 'scaffold', transformers: List[TransformerGenerator | str] = ['normalization'], reload: bool = True, data_dir: str | None = None, save_dir: str | None = None, **kwargs) Tuple[List[str], Tuple[Dataset, ...], List[Transformer]][source]

Load HOPV datasets. Does not do train/test split

The HOPV datasets consist of the “Harvard Organic Photovoltaic Dataset. This dataset includes 350 small molecules and polymers that were utilized as p-type materials in OPVs. Experimental properties include: HOMO [a.u.], LUMO [a.u.], Electrochemical gap [a.u.], Optical gap [a.u.], Power conversion efficiency [%], Open circuit potential [V], Short circuit current density [mA/cm^2], and fill factor [%]. Theoretical calculations in the original dataset have been removed (for now).

Lopez, Steven A., et al. “The Harvard organic photovoltaic dataset.” Scientific data 3.1 (2016): 1-7.

  • featurizer (Featurizer or str) – the featurizer to use for processing the data. Alternatively you can pass one of the names from dc.molnet.featurizers as a shortcut.

  • splitter (Splitter or str) – the splitter to use for splitting the data into training, validation, and test sets. Alternatively you can pass one of the names from dc.molnet.splitters as a shortcut. If this is None, all the data will be included in a single dataset.

  • transformers (list of TransformerGenerators or strings) – the Transformers to apply to the data. Each one is specified by a TransformerGenerator or, as a shortcut, one of the names from dc.molnet.transformers.

  • reload (bool) – if True, the first call for a particular featurizer and splitter will cache the datasets to disk, and subsequent calls will reload the cached datasets.

  • data_dir (str) – a directory to save the raw data in

  • save_dir (str) – a directory to save the dataset in

HPPB Datasets

load_hppb(featurizer: Featurizer | str = 'ECFP', splitter: Splitter | str | None = 'scaffold', transformers: List[TransformerGenerator | str] = ['log'], reload: bool = True, data_dir: str | None = None, save_dir: str | None = None, **kwargs) Tuple[List[str], Tuple[Dataset, ...], List[Transformer]][source]

Loads the thermodynamic solubility datasets.

  • featurizer (Featurizer or str) – the featurizer to use for processing the data. Alternatively you can pass one of the names from dc.molnet.featurizers as a shortcut.

  • splitter (Splitter or str) – the splitter to use for splitting the data into training, validation, and test sets. Alternatively you can pass one of the names from dc.molnet.splitters as a shortcut. If this is None, all the data will be included in a single dataset.

  • transformers (list of TransformerGenerators or strings) – the Transformers to apply to the data. Each one is specified by a TransformerGenerator or, as a shortcut, one of the names from dc.molnet.transformers.

  • reload (bool) – if True, the first call for a particular featurizer and splitter will cache the datasets to disk, and subsequent calls will reload the cached datasets.

  • data_dir (str) – a directory to save the raw data in

  • save_dir (str) – a directory to save the dataset in

KAGGLE Datasets

load_kaggle(shard_size=2000, featurizer=None, split=None, reload=True)[source]

Loads kaggle datasets. Generates if not stored already.

The Kaggle dataset is an in-house dataset from Merck that was first introduced in the following paper:

Ma, Junshui, et al. “Deep neural nets as a method for quantitative structure–activity relationships.” Journal of chemical information and modeling 55.2 (2015): 263-274.

It contains 100,000 unique Merck in-house compounds that were measured on 15 enzyme inhibition and ADME/TOX datasets. Unlike most of the other datasets featured in MoleculeNet, the Kaggle collection does not have structures for the compounds tested since they were proprietary Merck compounds. However, the collection does feature pre-computed descriptors for these compounds.

Note that the original train/valid/test split from the source data was preserved here, so this function doesn’t allow for alternate modes of splitting. Similarly, since the source data came pre-featurized, it is not possible to apply alternative featurizations.

  • shard_size (int, optional) – Size of the DiskDataset shards to write on disk

  • featurizer (optional) – Ignored since featurization pre-computed

  • split (optional) – Ignored since split pre-computed

  • reload (bool, optional) – Whether to automatically re-load from disk

Kinase Datasets

load_kinase(shard_size=2000, featurizer=None, split=None, reload=True)[source]

Loads Kinase datasets, does not do train/test split

The Kinase dataset is an in-house dataset from Merck that was first introduced in the following paper: Ramsundar, Bharath, et al. “Is multitask deep learning practical for pharma?.” Journal of chemical information and modeling 57.8 (2017): 2068-2076.

It contains 2500 Merck in-house compounds that were measured for IC50 of inhibition on 99 protein kinases. Unlike most of the other datasets featured in MoleculeNet, the Kinase collection does not have structures for the compounds tested since they were proprietary Merck compounds. However, the collection does feature pre-computed descriptors for these compounds.

Note that the original train/valid/test split from the source data was preserved here, so this function doesn’t allow for alternate modes of splitting. Similarly, since the source data came pre-featurized, it is not possible to apply alternative featurizations.

  • shard_size (int, optional) – Size of the DiskDataset shards to write on disk

  • featurizer (optional) – Ignored since featurization pre-computed

  • split (optional) – Ignored since split pre-computed

  • reload (bool, optional) – Whether to automatically re-load from disk

Lipo Datasets

load_lipo(featurizer: Featurizer | str = 'ECFP', splitter: Splitter | str | None = 'scaffold', transformers: List[TransformerGenerator | str] = ['normalization'], reload: bool = True, data_dir: str | None = None, save_dir: str | None = None, **kwargs) Tuple[List[str], Tuple[Dataset, ...], List[Transformer]][source]

Load Lipophilicity dataset

Lipophilicity is an important feature of drug molecules that affects both membrane permeability and solubility. The lipophilicity dataset, curated from ChEMBL database, provides experimental results of octanol/water distribution coefficient (logD at pH 7.4) of 4200 compounds.

Scaffold splitting is recommended for this dataset.

The raw data csv file contains columns below:

  • “smiles” - SMILES representation of the molecular structure

  • “exp” - Measured octanol/water distribution coefficient (logD) of the

    compound, used as label

  • featurizer (Featurizer or str) – the featurizer to use for processing the data. Alternatively you can pass one of the names from dc.molnet.featurizers as a shortcut.

  • splitter (Splitter or str) – the splitter to use for splitting the data into training, validation, and test sets. Alternatively you can pass one of the names from dc.molnet.splitters as a shortcut. If this is None, all the data will be included in a single dataset.

  • transformers (list of TransformerGenerators or strings) – the Transformers to apply to the data. Each one is specified by a TransformerGenerator or, as a shortcut, one of the names from dc.molnet.transformers.

  • reload (bool) – if True, the first call for a particular featurizer and splitter will cache the datasets to disk, and subsequent calls will reload the cached datasets.

  • data_dir (str) – a directory to save the raw data in

  • save_dir (str) – a directory to save the dataset in


Materials Datasets

Materials datasets include inorganic crystal structures, chemical compositions, and target properties like formation energies and band gaps. Machine learning problems in materials science commonly include predicting the value of a continuous (regression) or categorical (classification) property of a material based on its chemical composition or crystal structure. “Inverse design” is also of great interest, in which ML methods generate crystal structures that have a desired property. Other areas where ML is applicable in materials include: discovering new or modified phenomenological models that describe material behavior

load_bandgap(featurizer: ~deepchem.feat.base_classes.Featurizer | str = ElementPropertyFingerprint[data_source='matminer'], splitter: ~deepchem.splits.splitters.Splitter | str | None = 'random', transformers: ~typing.List[~deepchem.molnet.load_function.molnet_loader.TransformerGenerator | str] = ['normalization'], reload: bool = True, data_dir: str | None = None, save_dir: str | None = None, **kwargs) Tuple[List[str], Tuple[Dataset, ...], List[Transformer]][source]

Load band gap dataset.

Contains 4604 experimentally measured band gaps for inorganic crystal structure compositions. In benchmark studies, random forest models achieved a mean average error of 0.45 eV during five-fold nested cross validation on this dataset.

For more details on the dataset see [1]_. For more details on previous benchmarks for this dataset, see [2]_.

  • featurizer (Featurizer or str) – the featurizer to use for processing the data. Alternatively you can pass one of the names from dc.molnet.featurizers as a shortcut.

  • splitter (Splitter or str) – the splitter to use for splitting the data into training, validation, and test sets. Alternatively you can pass one of the names from dc.molnet.splitters as a shortcut. If this is None, all the data will be included in a single dataset.

  • transformers (list of TransformerGenerators or strings) – the Transformers to apply to the data. Each one is specified by a TransformerGenerator or, as a shortcut, one of the names from dc.molnet.transformers.

  • reload (bool) – if True, the first call for a particular featurizer and splitter will cache the datasets to disk, and subsequent calls will reload the cached datasets.

  • data_dir (str) – a directory to save the raw data in

  • save_dir (str) – a directory to save the dataset in


tasks, datasets, transformers


Column names corresponding to machine learning target variables.


train, validation, test splits of data as instances.


deepchem.trans.transformers.Transformer instances applied to dataset.

Return type:




>> import deepchem as dc
>> tasks, datasets, transformers = dc.molnet.load_bandgap()
>> train_dataset, val_dataset, test_dataset = datasets
>> n_tasks = len(tasks)
>> n_features = train_dataset.get_data_shape()[0]
>> model = dc.models.MultitaskRegressor(n_tasks, n_features)
load_perovskite(featurizer: ~deepchem.feat.base_classes.Featurizer | str = DummyFeaturizer[], splitter: ~deepchem.splits.splitters.Splitter | str | None = 'random', transformers: ~typing.List[~deepchem.molnet.load_function.molnet_loader.TransformerGenerator | str] = ['normalization'], reload: bool = True, data_dir: str | None = None, save_dir: str | None = None, **kwargs) Tuple[List[str], Tuple[Dataset, ...], List[Transformer]][source]

Load perovskite dataset.

Contains 18928 perovskite structures and their formation energies. In benchmark studies, random forest models and crystal graph neural networks achieved mean average error of 0.23 and 0.05 eV/atom, respectively, during five-fold nested cross validation on this dataset.

For more details on the dataset see [1]_. For more details on previous benchmarks for this dataset, see [2]_.

  • featurizer (Featurizer or str) – the featurizer to use for processing the data. Alternatively you can pass one of the names from dc.molnet.featurizers as a shortcut.

  • splitter (Splitter or str) – the splitter to use for splitting the data into training, validation, and test sets. Alternatively you can pass one of the names from dc.molnet.splitters as a shortcut. If this is None, all the data will be included in a single dataset.

  • transformers (list of TransformerGenerators or strings) – the Transformers to apply to the data. Each one is specified by a TransformerGenerator or, as a shortcut, one of the names from dc.molnet.transformers.

  • reload (bool) – if True, the first call for a particular featurizer and splitter will cache the datasets to disk, and subsequent calls will reload the cached datasets.

  • data_dir (str) – a directory to save the raw data in

  • save_dir (str) – a directory to save the dataset in


tasks, datasets, transformers


Column names corresponding to machine learning target variables.


train, validation, test splits of data as instances.


deepchem.trans.transformers.Transformer instances applied to dataset.

Return type:




>>> import deepchem as dc
>>> tasks, datasets, transformers = dc.molnet.load_perovskite()
>>> train_dataset, val_dataset, test_dataset = datasets
>>> model = dc.models.CGCNNModel(mode='regression', batch_size=32, learning_rate=0.001)
load_mp_formation_energy(featurizer: ~deepchem.feat.base_classes.Featurizer | str = SineCoulombMatrix[max_atoms=100, flatten=True], splitter: ~deepchem.splits.splitters.Splitter | str | None = 'random', transformers: ~typing.List[~deepchem.molnet.load_function.molnet_loader.TransformerGenerator | str] = ['normalization'], reload: bool = True, data_dir: str | None = None, save_dir: str | None = None, **kwargs) Tuple[List[str], Tuple[Dataset, ...], List[Transformer]][source]

Load mp formation energy dataset.

Contains 132752 calculated formation energies and inorganic crystal structures from the Materials Project database. In benchmark studies, random forest models achieved a mean average error of 0.116 eV/atom during five-folded nested cross validation on this dataset.

For more details on the dataset see [1]_. For more details on previous benchmarks for this dataset, see [2]_.

  • featurizer (Featurizer or str) – the featurizer to use for processing the data. Alternatively you can pass one of the names from dc.molnet.featurizers as a shortcut.

  • splitter (Splitter or str) – the splitter to use for splitting the data into training, validation, and test sets. Alternatively you can pass one of the names from dc.molnet.splitters as a shortcut. If this is None, all the data will be included in a single dataset.

  • transformers (list of TransformerGenerators or strings) – the Transformers to apply to the data. Each one is specified by a TransformerGenerator or, as a shortcut, one of the names from dc.molnet.transformers.

  • reload (bool) – if True, the first call for a particular featurizer and splitter will cache the datasets to disk, and subsequent calls will reload the cached datasets.

  • data_dir (str) – a directory to save the raw data in

  • save_dir (str) – a directory to save the dataset in


tasks, datasets, transformers


Column names corresponding to machine learning target variables.


train, validation, test splits of data as instances.


deepchem.trans.transformers.Transformer instances applied to dataset.

Return type:




>> import deepchem as dc
>> tasks, datasets, transformers = dc.molnet.load_mp_formation_energy()
>> train_dataset, val_dataset, test_dataset = datasets
>> n_tasks = len(tasks)
>> n_features = train_dataset.get_data_shape()[0]
>> model = dc.models.MultitaskRegressor(n_tasks, n_features)
load_mp_metallicity(featurizer: ~deepchem.feat.base_classes.Featurizer | str = SineCoulombMatrix[max_atoms=100, flatten=True], splitter: ~deepchem.splits.splitters.Splitter | str | None = 'random', transformers: ~typing.List[~deepchem.molnet.load_function.molnet_loader.TransformerGenerator | str] = ['balancing'], reload: bool = True, data_dir: str | None = None, save_dir: str | None = None, **kwargs) Tuple[List[str], Tuple[Dataset, ...], List[Transformer]][source]

Load mp formation energy dataset.

Contains 106113 inorganic crystal structures from the Materials Project database labeled as metals or nonmetals. In benchmark studies, random forest models achieved a mean ROC-AUC of 0.9 during five-folded nested cross validation on this dataset.

For more details on the dataset see [1]_. For more details on previous benchmarks for this dataset, see [2]_.

  • featurizer (Featurizer or str) – the featurizer to use for processing the data. Alternatively you can pass one of the names from dc.molnet.featurizers as a shortcut.

  • splitter (Splitter or str) – the splitter to use for splitting the data into training, validation, and test sets. Alternatively you can pass one of the names from dc.molnet.splitters as a shortcut. If this is None, all the data will be included in a single dataset.

  • transformers (list of TransformerGenerators or strings) – the Transformers to apply to the data. Each one is specified by a TransformerGenerator or, as a shortcut, one of the names from dc.molnet.transformers.

  • reload (bool) – if True, the first call for a particular featurizer and splitter will cache the datasets to disk, and subsequent calls will reload the cached datasets.

  • data_dir (str) – a directory to save the raw data in

  • save_dir (str) – a directory to save the dataset in


tasks, datasets, transformers


Column names corresponding to machine learning target variables.


train, validation, test splits of data as instances.


deepchem.trans.transformers.Transformer instances applied to dataset.

Return type:




>> import deepchem as dc
>> tasks, datasets, transformers = dc.molnet.load_mp_metallicity()
>> train_dataset, val_dataset, test_dataset = datasets
>> n_tasks = len(tasks)
>> n_features = train_dataset.get_data_shape()[0]
>> model = dc.models.MultitaskRegressor(n_tasks, n_features)

MUV Datasets

load_muv(featurizer: Featurizer | str = 'ECFP', splitter: Splitter | str | None = 'scaffold', transformers: List[TransformerGenerator | str] = ['balancing'], reload: bool = True, data_dir: str | None = None, save_dir: str | None = None, **kwargs) Tuple[List[str], Tuple[Dataset, ...], List[Transformer]][source]

Load MUV dataset

The Maximum Unbiased Validation (MUV) group is a benchmark dataset selected from PubChem BioAssay by applying a refined nearest neighbor analysis.

The MUV dataset contains 17 challenging tasks for around 90 thousand compounds and is specifically designed for validation of virtual screening techniques.

Scaffold splitting is recommended for this dataset.

The raw data csv file contains columns below:

  • “mol_id” - PubChem CID of the compound

  • “smiles” - SMILES representation of the molecular structure

  • “MUV-XXX” - Measured results (Active/Inactive) for bioassays

  • featurizer (Featurizer or str) – the featurizer to use for processing the data. Alternatively you can pass one of the names from dc.molnet.featurizers as a shortcut.

  • splitter (Splitter or str) – the splitter to use for splitting the data into training, validation, and test sets. Alternatively you can pass one of the names from dc.molnet.splitters as a shortcut. If this is None, all the data will be included in a single dataset.

  • transformers (list of TransformerGenerators or strings) – the Transformers to apply to the data. Each one is specified by a TransformerGenerator or, as a shortcut, one of the names from dc.molnet.transformers.

  • reload (bool) – if True, the first call for a particular featurizer and splitter will cache the datasets to disk, and subsequent calls will reload the cached datasets.

  • data_dir (str) – a directory to save the raw data in

  • save_dir (str) – a directory to save the dataset in


NCI Datasets

load_nci(featurizer: Featurizer | str = 'ECFP', splitter: Splitter | str | None = 'random', transformers: List[TransformerGenerator | str] = ['normalization'], reload: bool = True, data_dir: str | None = None, save_dir: str | None = None, **kwargs) Tuple[List[str], Tuple[Dataset, ...], List[Transformer]][source]

Load NCI dataset.

  • featurizer (Featurizer or str) – the featurizer to use for processing the data. Alternatively you can pass one of the names from dc.molnet.featurizers as a shortcut.

  • splitter (Splitter or str) – the splitter to use for splitting the data into training, validation, and test sets. Alternatively you can pass one of the names from dc.molnet.splitters as a shortcut. If this is None, all the data will be included in a single dataset.

  • transformers (list of TransformerGenerators or strings) – the Transformers to apply to the data. Each one is specified by a TransformerGenerator or, as a shortcut, one of the names from dc.molnet.transformers.

  • reload (bool) – if True, the first call for a particular featurizer and splitter will cache the datasets to disk, and subsequent calls will reload the cached datasets.

  • data_dir (str) – a directory to save the raw data in

  • save_dir (str) – a directory to save the dataset in

PCBA Datasets

load_pcba(featurizer: Featurizer | str = 'ECFP', splitter: Splitter | str | None = 'scaffold', transformers: List[TransformerGenerator | str] = ['balancing'], reload: bool = True, data_dir: str | None = None, save_dir: str | None = None, **kwargs) Tuple[List[str], Tuple[Dataset, ...], List[Transformer]][source]

Load PCBA dataset

PubChem BioAssay (PCBA) is a database consisting of biological activities of small molecules generated by high-throughput screening. We use a subset of PCBA, containing 128 bioassays measured over 400 thousand compounds, used by previous work to benchmark machine learning methods.

Random splitting is recommended for this dataset.

The raw data csv file contains columns below:

  • featurizer (Featurizer or str) – the featurizer to use for processing the data. Alternatively you can pass one of the names from dc.molnet.featurizers as a shortcut.

  • splitter (Splitter or str) – the splitter to use for splitting the data into training, validation, and test sets. Alternatively you can pass one of the names from dc.molnet.splitters as a shortcut. If this is None, all the data will be included in a single dataset.

  • transformers (list of TransformerGenerators or strings) – the Transformers to apply to the data. Each one is specified by a TransformerGenerator or, as a shortcut, one of the names from dc.molnet.transformers.

  • reload (bool) – if True, the first call for a particular featurizer and splitter will cache the datasets to disk, and subsequent calls will reload the cached datasets.

  • data_dir (str) – a directory to save the raw data in

  • save_dir (str) – a directory to save the dataset in


PDBBIND Datasets

load_pdbbind(featurizer: ComplexFeaturizer, splitter: Splitter | str | None = 'random', transformers: List[TransformerGenerator | str] = ['normalization'], reload: bool = True, data_dir: str | None = None, save_dir: str | None = None, pocket: bool = True, set_name: str = 'core', **kwargs) Tuple[List[str], Tuple[Dataset, ...], List[Transformer]][source]

Load PDBBind dataset.

The PDBBind dataset includes experimental binding affinity data and structures for 4852 protein-ligand complexes from the “refined set” and 12800 complexes from the “general set” in PDBBind v2019 and 193 complexes from the “core set” in PDBBind v2013. The refined set removes data with obvious problems in 3D structure, binding data, or other aspects and should therefore be a better starting point for docking/scoring studies. Details on the criteria used to construct the refined set can be found in [4]_. The general set does not include the refined set. The core set is a subset of the refined set that is not updated annually.

Random splitting is recommended for this dataset.

The raw dataset contains the columns below:

  • “ligand” - SDF of the molecular structure

  • “protein” - PDB of the protein structure

  • “CT_TOX” - Clinical trial results

  • featurizer (ComplexFeaturizer or str) – the complex featurizer to use for processing the data. Alternatively you can pass one of the names from dc.molnet.featurizers as a shortcut.

  • splitter (Splitter or str) – the splitter to use for splitting the data into training, validation, and test sets. Alternatively you can pass one of the names from dc.molnet.splitters as a shortcut. If this is None, all the data will be included in a single dataset.

  • transformers (list of TransformerGenerators or strings) – the Transformers to apply to the data. Each one is specified by a TransformerGenerator or, as a shortcut, one of the names from dc.molnet.transformers.

  • reload (bool) – if True, the first call for a particular featurizer and splitter will cache the datasets to disk, and subsequent calls will reload the cached datasets.

  • data_dir (str) – a directory to save the raw data in

  • save_dir (str) – a directory to save the dataset in

  • pocket (bool (default True)) – If true, use only the binding pocket for featurization.

  • set_name (str (default 'core')) – Name of dataset to download. ‘refined’, ‘general’, and ‘core’ are supported.


tasks, datasets, transformers

tasks: list

Column names corresponding to machine learning target variables.

datasets: tuple

train, validation, test splits of data as instances.

transformers: list

deepchem.trans.transformers.Transformer instances applied to dataset.

Return type:



PPB Datasets

load_ppb(featurizer: Featurizer | str = 'ECFP', splitter: Splitter | str | None = 'scaffold', transformers: List[TransformerGenerator | str] = ['normalization'], reload: bool = True, data_dir: str | None = None, save_dir: str | None = None, **kwargs) Tuple[List[str], Tuple[Dataset, ...], List[Transformer]][source]

Load PPB datasets.

  • featurizer (Featurizer or str) – the featurizer to use for processing the data. Alternatively you can pass one of the names from dc.molnet.featurizers as a shortcut.

  • splitter (Splitter or str) – the splitter to use for splitting the data into training, validation, and test sets. Alternatively you can pass one of the names from dc.molnet.splitters as a shortcut. If this is None, all the data will be included in a single dataset.

  • transformers (list of TransformerGenerators or strings) – the Transformers to apply to the data. Each one is specified by a TransformerGenerator or, as a shortcut, one of the names from dc.molnet.transformers.

  • reload (bool) – if True, the first call for a particular featurizer and splitter will cache the datasets to disk, and subsequent calls will reload the cached datasets.

  • data_dir (str) – a directory to save the raw data in

  • save_dir (str) – a directory to save the dataset in

QM7 Datasets

load_qm7(featurizer: ~deepchem.feat.base_classes.Featurizer | str = CoulombMatrix[max_atoms=23, remove_hydrogens=False, randomize=False, upper_tri=False, n_samples=1, seed=None], splitter: ~deepchem.splits.splitters.Splitter | str | None = 'random', transformers: ~typing.List[~deepchem.molnet.load_function.molnet_loader.TransformerGenerator | str] = ['normalization'], reload: bool = True, data_dir: str | None = None, save_dir: str | None = None, **kwargs) Tuple[List[str], Tuple[Dataset, ...], List[Transformer]][source]

Load QM7 dataset

QM7 is a subset of GDB-13 (a database of nearly 1 billion stable and synthetically accessible organic molecules) containing up to 7 heavy atoms C, N, O, and S. The 3D Cartesian coordinates of the most stable conformations and their atomization energies were determined using ab-initio density functional theory (PBE0/tier2 basis set). This dataset also provided Coulomb matrices as calculated in [Rupp et al. PRL, 2012]:

Stratified splitting is recommended for this dataset.

The data file (.mat format, we recommend using for python users to load this original data) contains five arrays:

  • “X” - (7165 x 23 x 23), Coulomb matrices

  • “T” - (7165), atomization energies (unit: kcal/mol)

  • “P” - (5 x 1433), cross-validation splits as used in [Montavon et al.

    NIPS, 2012]

  • “Z” - (7165 x 23), atomic charges

  • “R” - (7165 x 23 x 3), cartesian coordinate (unit: Bohr) of each atom in

    the molecules

  • featurizer (Featurizer or str) – the featurizer to use for processing the data. Alternatively you can pass one of the names from dc.molnet.featurizers as a shortcut.

  • splitter (Splitter or str) – the splitter to use for splitting the data into training, validation, and test sets. Alternatively you can pass one of the names from dc.molnet.splitters as a shortcut. If this is None, all the data will be included in a single dataset.

  • transformers (list of TransformerGenerators or strings) – the Transformers to apply to the data. Each one is specified by a TransformerGenerator or, as a shortcut, one of the names from dc.molnet.transformers.

  • reload (bool) – if True, the first call for a particular featurizer and splitter will cache the datasets to disk, and subsequent calls will reload the cached datasets.

  • data_dir (str) – a directory to save the raw data in

  • save_dir (str) – a directory to save the dataset in


DeepChem 2.4.0 has turned on sanitization for this dataset by default. For the QM7 dataset, this means that calling this function will return 6838 compounds instead of 7160 in the source dataset file. This appears to be due to valence specification mismatches in the dataset that weren’t caught in earlier more lax versions of RDKit. Note that this may subtly affect benchmarking results on this dataset.


QM8 Datasets

load_qm8(featurizer: ~deepchem.feat.base_classes.Featurizer | str = CoulombMatrix[max_atoms=26, remove_hydrogens=False, randomize=False, upper_tri=False, n_samples=1, seed=None], splitter: ~deepchem.splits.splitters.Splitter | str | None = 'random', transformers: ~typing.List[~deepchem.molnet.load_function.molnet_loader.TransformerGenerator | str] = ['normalization'], reload: bool = True, data_dir: str | None = None, save_dir: str | None = None, **kwargs) Tuple[List[str], Tuple[Dataset, ...], List[Transformer]][source]

Load QM8 dataset

QM8 is the dataset used in a study on modeling quantum mechanical calculations of electronic spectra and excited state energy of small molecules. Multiple methods, including time-dependent density functional theories (TDDFT) and second-order approximate coupled-cluster (CC2), are applied to a collection of molecules that include up to eight heavy atoms (also a subset of the GDB-17 database). In our collection, there are four excited state properties calculated by four different methods on 22 thousand samples:

S0 -> S1 transition energy E1 and the corresponding oscillator strength f1

S0 -> S2 transition energy E2 and the corresponding oscillator strength f2

E1, E2, f1, f2 are in atomic units. f1, f2 are in length representation

Random splitting is recommended for this dataset.

The source data contain:

  • qm8.sdf: molecular structures

  • qm8.sdf.csv: tables for molecular properties

  • Column 1: Molecule ID (gdb9 index) mapping to the .sdf file

  • Columns 2-5: RI-CC2/def2TZVP

  • Columns 6-9: LR-TDPBE0/def2SVP

  • Columns 10-13: LR-TDPBE0/def2TZVP

  • Columns 14-17: LR-TDCAM-B3LYP/def2TZVP

  • featurizer (Featurizer or str) – the featurizer to use for processing the data. Alternatively you can pass one of the names from dc.molnet.featurizers as a shortcut.

  • splitter (Splitter or str) – the splitter to use for splitting the data into training, validation, and test sets. Alternatively you can pass one of the names from dc.molnet.splitters as a shortcut. If this is None, all the data will be included in a single dataset.

  • transformers (list of TransformerGenerators or strings) – the Transformers to apply to the data. Each one is specified by a TransformerGenerator or, as a shortcut, one of the names from dc.molnet.transformers.

  • reload (bool) – if True, the first call for a particular featurizer and splitter will cache the datasets to disk, and subsequent calls will reload the cached datasets.

  • data_dir (str) – a directory to save the raw data in

  • save_dir (str) – a directory to save the dataset in


DeepChem 2.4.0 has turned on sanitization for this dataset by default. For the QM8 dataset, this means that calling this function will return 21747 compounds instead of 21786 in the source dataset file. This appears to be due to valence specification mismatches in the dataset that weren’t caught in earlier more lax versions of RDKit. Note that this may subtly affect benchmarking results on this dataset.


QM9 Datasets

load_qm9(featurizer: ~deepchem.feat.base_classes.Featurizer | str = CoulombMatrix[max_atoms=29, remove_hydrogens=False, randomize=False, upper_tri=False, n_samples=1, seed=None], splitter: ~deepchem.splits.splitters.Splitter | str | None = 'random', transformers: ~typing.List[~deepchem.molnet.load_function.molnet_loader.TransformerGenerator | str] = ['normalization'], reload: bool = True, data_dir: str | None = None, save_dir: str | None = None, **kwargs) Tuple[List[str], Tuple[Dataset, ...], List[Transformer]][source]

Load QM9 dataset

QM9 is a comprehensive dataset that provides geometric, energetic, electronic and thermodynamic properties for a subset of GDB-17 database, comprising 134 thousand stable organic molecules with up to 9 heavy atoms. All molecules are modeled using density functional theory (B3LYP/6-31G(2df,p) based DFT).

Random splitting is recommended for this dataset.

The source data contain:

  • qm9.sdf: molecular structures

  • qm9.sdf.csv: tables for molecular properties

  • “mol_id” - Molecule ID (gdb9 index) mapping to the .sdf file

  • “A” - Rotational constant (unit: GHz)

  • “B” - Rotational constant (unit: GHz)

  • “C” - Rotational constant (unit: GHz)

  • “mu” - Dipole moment (unit: D)

  • “alpha” - Isotropic polarizability (unit: Bohr^3)

  • “homo” - Highest occupied molecular orbital energy (unit: Hartree)

  • “lumo” - Lowest unoccupied molecular orbital energy (unit: Hartree)

  • “gap” - Gap between HOMO and LUMO (unit: Hartree)

  • “r2” - Electronic spatial extent (unit: Bohr^2)

  • “zpve” - Zero point vibrational energy (unit: Hartree)

  • “u0” - Internal energy at 0K (unit: Hartree)

  • “u298” - Internal energy at 298.15K (unit: Hartree)

  • “h298” - Enthalpy at 298.15K (unit: Hartree)

  • “g298” - Free energy at 298.15K (unit: Hartree)

  • “cv” - Heat capavity at 298.15K (unit: cal/(mol*K))

  • “u0_atom” - Atomization energy at 0K (unit: kcal/mol)

  • “u298_atom” - Atomization energy at 298.15K (unit: kcal/mol)

  • “h298_atom” - Atomization enthalpy at 298.15K (unit: kcal/mol)

  • “g298_atom” - Atomization free energy at 298.15K (unit: kcal/mol)

“u0_atom” ~ “g298_atom” (used in MoleculeNet) are calculated from the differences between “u0” ~ “g298” and sum of reference energies of all atoms in the molecules, as given in

  • featurizer (Featurizer or str) – the featurizer to use for processing the data. Alternatively you can pass one of the names from dc.molnet.featurizers as a shortcut.

  • splitter (Splitter or str) – the splitter to use for splitting the data into training, validation, and test sets. Alternatively you can pass one of the names from dc.molnet.splitters as a shortcut. If this is None, all the data will be included in a single dataset.

  • transformers (list of TransformerGenerators or strings) – the Transformers to apply to the data. Each one is specified by a TransformerGenerator or, as a shortcut, one of the names from dc.molnet.transformers.

  • reload (bool) – if True, the first call for a particular featurizer and splitter will cache the datasets to disk, and subsequent calls will reload the cached datasets.

  • data_dir (str) – a directory to save the raw data in

  • save_dir (str) – a directory to save the dataset in


DeepChem 2.4.0 has turned on sanitization for this dataset by default. For the QM9 dataset, this means that calling this function will return 132480 compounds instead of 133885 in the source dataset file. This appears to be due to valence specification mismatches in the dataset that weren’t caught in earlier more lax versions of RDKit. Note that this may subtly affect benchmarking results on this dataset.


SAMPL Datasets

load_sampl(featurizer: Featurizer | str = 'ECFP', splitter: Splitter | str | None = 'scaffold', transformers: List[TransformerGenerator | str] = ['normalization'], reload: bool = True, data_dir: str | None = None, save_dir: str | None = None, **kwargs) Tuple[List[str], Tuple[Dataset, ...], List[Transformer]][source]

Load SAMPL(FreeSolv) dataset

The Free Solvation Database, FreeSolv(SAMPL), provides experimental and calculated hydration free energy of small molecules in water. The calculated values are derived from alchemical free energy calculations using molecular dynamics simulations. The experimental values are included in the benchmark collection.

Random splitting is recommended for this dataset.

The raw data csv file contains columns below:

  • “iupac” - IUPAC name of the compound

  • “smiles” - SMILES representation of the molecular structure

  • “expt” - Measured solvation energy (unit: kcal/mol) of the compound,

    used as label

  • “calc” - Calculated solvation energy (unit: kcal/mol) of the compound

  • featurizer (Featurizer or str) – the featurizer to use for processing the data. Alternatively you can pass one of the names from dc.molnet.featurizers as a shortcut.

  • splitter (Splitter or str) – the splitter to use for splitting the data into training, validation, and test sets. Alternatively you can pass one of the names from dc.molnet.splitters as a shortcut. If this is None, all the data will be included in a single dataset.

  • transformers (list of TransformerGenerators or strings) – the Transformers to apply to the data. Each one is specified by a TransformerGenerator or, as a shortcut, one of the names from dc.molnet.transformers.

  • reload (bool) – if True, the first call for a particular featurizer and splitter will cache the datasets to disk, and subsequent calls will reload the cached datasets.

  • data_dir (str) – a directory to save the raw data in

  • save_dir (str) – a directory to save the dataset in


SIDER Datasets

load_sider(featurizer: Featurizer | str = 'ECFP', splitter: Splitter | str | None = 'scaffold', transformers: List[TransformerGenerator | str] = ['balancing'], reload: bool = True, data_dir: str | None = None, save_dir: str | None = None, **kwargs) Tuple[List[str], Tuple[Dataset, ...], List[Transformer]][source]

Load SIDER dataset

The Side Effect Resource (SIDER) is a database of marketed drugs and adverse drug reactions (ADR). The version of the SIDER dataset in DeepChem has grouped drug side effects into 27 system organ classes following MedDRA classifications measured for 1427 approved drugs.

Random splitting is recommended for this dataset.

The raw data csv file contains columns below:

  • “smiles”: SMILES representation of the molecular structure

  • “Hepatobiliary disorders” ~ “Injury, poisoning and procedural

    complications”: Recorded side effects for the drug. Please refer to for details on ADRs.

  • featurizer (Featurizer or str) – the featurizer to use for processing the data. Alternatively you can pass one of the names from dc.molnet.featurizers as a shortcut.

  • splitter (Splitter or str) – the splitter to use for splitting the data into training, validation, and test sets. Alternatively you can pass one of the names from dc.molnet.splitters as a shortcut. If this is None, all the data will be included in a single dataset.

  • transformers (list of TransformerGenerators or strings) – the Transformers to apply to the data. Each one is specified by a TransformerGenerator or, as a shortcut, one of the names from dc.molnet.transformers.

  • reload (bool) – if True, the first call for a particular featurizer and splitter will cache the datasets to disk, and subsequent calls will reload the cached datasets.

  • data_dir (str) – a directory to save the raw data in

  • save_dir (str) – a directory to save the dataset in


Thermosol Datasets

load_thermosol(featurizer: Featurizer | str = 'ECFP', splitter: Splitter | str | None = 'scaffold', transformers: List[TransformerGenerator | str] = [], reload: bool = True, data_dir: str | None = None, save_dir: str | None = None, **kwargs) Tuple[List[str], Tuple[Dataset, ...], List[Transformer]][source]

Loads the thermodynamic solubility datasets.

  • featurizer (Featurizer or str) – the featurizer to use for processing the data. Alternatively you can pass one of the names from dc.molnet.featurizers as a shortcut.

  • splitter (Splitter or str) – the splitter to use for splitting the data into training, validation, and test sets. Alternatively you can pass one of the names from dc.molnet.splitters as a shortcut. If this is None, all the data will be included in a single dataset.

  • transformers (list of TransformerGenerators or strings) – the Transformers to apply to the data. Each one is specified by a TransformerGenerator or, as a shortcut, one of the names from dc.molnet.transformers.

  • reload (bool) – if True, the first call for a particular featurizer and splitter will cache the datasets to disk, and subsequent calls will reload the cached datasets.

  • data_dir (str) – a directory to save the raw data in

  • save_dir (str) – a directory to save the dataset in

Tox21 Datasets

load_tox21(featurizer: Featurizer | str = 'ECFP', splitter: Splitter | str | None = 'scaffold', transformers: List[TransformerGenerator | str] = ['balancing'], reload: bool = True, data_dir: str | None = None, save_dir: str | None = None, tasks: List[str] = ['NR-AR', 'NR-AR-LBD', 'NR-AhR', 'NR-Aromatase', 'NR-ER', 'NR-ER-LBD', 'NR-PPAR-gamma', 'SR-ARE', 'SR-ATAD5', 'SR-HSE', 'SR-MMP', 'SR-p53'], **kwargs) Tuple[List[str], Tuple[Dataset, ...], List[Transformer]][source]

Load Tox21 dataset

The “Toxicology in the 21st Century” (Tox21) initiative created a public database measuring toxicity of compounds, which has been used in the 2014 Tox21 Data Challenge. This dataset contains qualitative toxicity measurements for 8k compounds on 12 different targets, including nuclear receptors and stress response pathways.

Random splitting is recommended for this dataset.

The raw data csv file contains columns below:

  • “smiles” - SMILES representation of the molecular structure

  • “NR-XXX” - Nuclear receptor signaling bioassays results

  • “SR-XXX” - Stress response bioassays results

please refer to for details.

  • featurizer (Featurizer or str) – the featurizer to use for processing the data. Alternatively you can pass one of the names from dc.molnet.featurizers as a shortcut.

  • splitter (Splitter or str) – the splitter to use for splitting the data into training, validation, and test sets. Alternatively you can pass one of the names from dc.molnet.splitters as a shortcut. If this is None, all the data will be included in a single dataset.

  • transformers (list of TransformerGenerators or strings) – the Transformers to apply to the data. Each one is specified by a TransformerGenerator or, as a shortcut, one of the names from dc.molnet.transformers.

  • reload (bool) – if True, the first call for a particular featurizer and splitter will cache the datasets to disk, and subsequent calls will reload the cached datasets.

  • data_dir (str) – a directory to save the raw data in

  • save_dir (str) – a directory to save the dataset in

  • tasks (List[str], (optional)) – Specify the set of tasks to load. If no task is specified, then it loads

  • NR-AR (the default set of tasks which are) –


  • NR-AhR

  • NR-Aromatase

  • NR-ER

:param : :param NR-ER-LBD: :param NR-PPAR-gamma: :param SR-ARE: :param SR-ATAD5: :param SR-HSE: :param SR-MMP: :param SR-p53.:


Toxcast Datasets

load_toxcast(featurizer: Featurizer | str = 'ECFP', splitter: Splitter | str | None = 'scaffold', transformers: List[TransformerGenerator | str] = ['balancing'], reload: bool = True, data_dir: str | None = None, save_dir: str | None = None, **kwargs) Tuple[List[str], Tuple[Dataset, ...], List[Transformer]][source]

Load Toxcast dataset

ToxCast is an extended data collection from the same initiative as Tox21, providing toxicology data for a large library of compounds based on in vitro high-throughput screening. The processed collection includes qualitative results of over 600 experiments on 8k compounds.

Random splitting is recommended for this dataset.

The raw data csv file contains columns below:

  • featurizer (Featurizer or str) – the featurizer to use for processing the data. Alternatively you can pass one of the names from dc.molnet.featurizers as a shortcut.

  • splitter (Splitter or str) – the splitter to use for splitting the data into training, validation, and test sets. Alternatively you can pass one of the names from dc.molnet.splitters as a shortcut. If this is None, all the data will be included in a single dataset.

  • transformers (list of TransformerGenerators or strings) – the Transformers to apply to the data. Each one is specified by a TransformerGenerator or, as a shortcut, one of the names from dc.molnet.transformers.

  • reload (bool) – if True, the first call for a particular featurizer and splitter will cache the datasets to disk, and subsequent calls will reload the cached datasets.

  • data_dir (str) – a directory to save the raw data in

  • save_dir (str) – a directory to save the dataset in


USPTO Datasets

load_uspto(featurizer: Featurizer | str = 'RxnFeaturizer', splitter: Splitter | str | None = None, transformers: List[TransformerGenerator | str] = [], reload: bool = True, data_dir: str | None = None, save_dir: str | None = None, subset: str = 'MIT', sep_reagent: bool = True, **kwargs) Tuple[List[str], Tuple[Dataset, ...], List[Transformer]][source]

Load USPTO Datasets.

The USPTO dataset consists of over 1.8 Million organic chemical reactions extracted from US patents and patent applications. The dataset contains the reactions in the form of reaction SMILES, which have the general format: reactant>reagent>product.

Molnet provides ability to load subsets of the USPTO dataset namely MIT, STEREO and 50K. The MIT dataset contains around 479K reactions, curated by jin et al. The STEREO dataset contains around 1 Million Reactions, it does not have duplicates and the reactions include stereochemical information. The 50K dataset contatins 50,000 reactions and is the benchmark for retrosynthesis predictions. The reactions are additionally classified into 10 reaction classes. The canonicalized version of the dataset used by the loader is the same as that used by Somnath et. al.

The loader uses the SpecifiedSplitter to use the same splits as specified by Schwaller et. al and Dai et. al. Custom splitters could also be used. There is a toggle in the loader to skip the source/target transformation needed for seq2seq tasks. There is an additional toggle to load the dataset with the reagents and reactants separated or mixed. This alters the entries in source by replacing the ‘>’ with ‘.’ , effectively loading them as an unified SMILES string.

  • featurizer (Featurizer or str) – the featurizer to use for processing the data. Alternatively you can pass one of the names from dc.molnet.featurizers as a shortcut.

  • splitter (Splitter or str) – the splitter to use for splitting the data into training, validation, and test sets. Alternatively you can pass one of the names from dc.molnet.splitters as a shortcut. If this is None, all the data will be included in a single dataset.

  • transformers (list of TransformerGenerators or strings) – the Transformers to apply to the data. Each one is specified by a TransformerGenerator or, as a shortcut, one of the names from dc.molnet.transformers.

  • reload (bool) – if True, the first call for a particular featurizer and splitter will cache the datasets to disk, and subsequent calls will reload the cached datasets.

  • data_dir (str) – a directory to save the raw data in

  • save_dir (str) – a directory to save the dataset in

  • subset (str (default 'MIT')) – Subset of dataset to download. ‘FULL’, ‘MIT’, ‘STEREO’, and ‘50K’ are supported.

  • sep_reagent (bool (default True)) – Toggle to load dataset with reactants and reagents either separated or mixed.

  • skip_transform (bool (default True)) – Toggle to skip the source/target transformation.


tasks, datasets, transformers


Column names corresponding to machine learning target variables.


train, validation, test splits of data as instances.


deepchem.trans.transformers.Transformer instances applied to dataset.

Return type:



UV Datasets

load_uv(shard_size=2000, featurizer=None, split=None, reload=True)[source]

Load UV dataset; does not do train/test split

The UV dataset is an in-house dataset from Merck that was first introduced in the following paper: Ramsundar, Bharath, et al. “Is multitask deep learning practical for pharma?.” Journal of chemical information and modeling 57.8 (2017): 2068-2076.

The UV dataset tests 10,000 of Merck’s internal compounds on 190 absorption wavelengths between 210 and 400 nm. Unlike most of the other datasets featured in MoleculeNet, the UV collection does not have structures for the compounds tested since they were proprietary Merck compounds. However, the collection does feature pre-computed descriptors for these compounds.

Note that the original train/valid/test split from the source data was preserved here, so this function doesn’t allow for alternate modes of splitting. Similarly, since the source data came pre-featurized, it is not possible to apply alternative featurizations.

  • shard_size (int, optional) – Size of the DiskDataset shards to write on disk

  • featurizer (optional) – Ignored since featurization pre-computed

  • split (optional) – Ignored since split pre-computed

  • reload (bool, optional) – Whether to automatically re-load from disk

ZINC15 Datasets

load_zinc15(featurizer: Featurizer | str = 'OneHot', splitter: Splitter | str | None = 'random', transformers: List[TransformerGenerator | str] = ['normalization'], reload: bool = True, data_dir: str | None = None, save_dir: str | None = None, dataset_size: str = '250K', dataset_dimension: str = '2D', tasks: List[str] = ['mwt', 'logp', 'reactive'], **kwargs) Tuple[List[str], Tuple[Dataset, ...], List[Transformer]][source]

Load zinc15.

ZINC15 is a dataset of over 230 million purchasable compounds for virtual screening of small molecules to identify structures that are likely to bind to drug targets. ZINC15 data is currently available in 2D (SMILES string) format.

MolNet provides subsets of 250K, 1M, and 10M “lead-like” compounds from ZINC15. The full dataset of 270M “goldilocks” compounds is also available. Compounds in ZINC15 are labeled by their molecular weight and LogP (solubility) values. Each compound also has information about how readily available (purchasable) it is and its reactivity. Lead-like compounds have molecular weight between 300 and 350 Daltons and LogP between -1 and 3.5. Goldilocks compounds are lead-like compounds with LogP values further restricted to between 2 and 3.

If reload = True and data_dir (save_dir) is specified, the loader will attempt to load the raw dataset (featurized dataset) from disk. Otherwise, the dataset will be downloaded from the DeepChem AWS bucket.

For more information on ZINC15, please see [1]_ and

  • featurizer (Featurizer or str) – the featurizer to use for processing the data. Alternatively you can pass one of the names from dc.molnet.featurizers as a shortcut.

  • splitter (Splitter or str) – the splitter to use for splitting the data into training, validation, and test sets. Alternatively you can pass one of the names from dc.molnet.splitters as a shortcut. If this is None, all the data will be included in a single dataset.

  • transformers (list of TransformerGenerators or strings) – the Transformers to apply to the data. Each one is specified by a TransformerGenerator or, as a shortcut, one of the names from dc.molnet.transformers.

  • reload (bool) – if True, the first call for a particular featurizer and splitter will cache the datasets to disk, and subsequent calls will reload the cached datasets.

  • data_dir (str) – a directory to save the raw data in

  • save_dir (str) – a directory to save the dataset in

  • size (str (default '250K')) – Size of dataset to download. ‘250K’, ‘1M’, ‘10M’, and ‘270M’ are supported.

  • format (str (default '2D')) – Format of data to download. 2D SMILES strings or 3D SDF files.

  • tasks (List[str], (optional) default: [‘molwt’, ‘logp’, ‘reactive’]) – Specify the set of tasks to load. If no task is specified, then it loads

  • molwt (the default set of tasks which are) –

  • logp

  • reactive.


tasks, datasets, transformers


Column names corresponding to machine learning target variables.


train, validation, test splits of data as instances.


deepchem.trans.transformers.Transformer instances applied to dataset.

Return type:



The total ZINC dataset with SMILES strings contains hundreds of millions of compounds and is over 100GB! ZINC250K is recommended for experimentation. The full set of 270M goldilocks compounds is 23GB.


Platinum Adsorption Dataset

load_Platinum_Adsorption(featurizer: ~deepchem.feat.base_classes.Featurizer | str = SineCoulombMatrix[max_atoms=100, flatten=True], splitter: ~deepchem.splits.splitters.Splitter | str | None = 'random', transformers: ~typing.List[~deepchem.molnet.load_function.molnet_loader.TransformerGenerator | str] = [], reload: bool = True, data_dir: str | None = None, save_dir: str | None = None, **kwargs) Tuple[List[str], Tuple[Dataset, ...], List[Transformer]][source]

Load Platinum Adsorption Dataset

The dataset consist of diffrent configurations of Adsorbates (i.e N and NO) on Platinum surface represented as Lattice and their formation energy. There are 648 diffrent adsorbate configuration in this datasets represented as Pymatgen Structure objects.

  1. Pymatgen structure object with site_properties with following key value.
    • “SiteTypes”, mentioning if it is a active site “A1” or spectator

      site “S1”.

    • “oss”, diffrent occupational sites. For spectator sites make it -1.

  • featurizer (Featurizer (default LCNNFeaturizer)) – the featurizer to use for processing the data. Reccomended to use the LCNNFeaturiser.

  • splitter (Splitter (default RandomSplitter)) – the splitter to use for splitting the data into training, validation, and test sets. Alternatively you can pass one of the names from dc.molnet.splitters as a shortcut. If this is None, all the data will be included in a single dataset.

  • transformers (list of TransformerGenerators or strings. the Transformers to) – apply to the data and appropritate featuriser. Does’nt require any transformation for LCNN_featuriser

  • reload (bool) – if True, the first call for a particular featurizer and splitter will cache the datasets to disk, and subsequent calls will reload the cached datasets.

  • data_dir (str) – a directory to save the raw data in

  • save_dir (str, optional (default None)) – a directory to save the dataset in



>> import deepchem as dc
>> tasks, datasets, transformers = load_Platinum_Adsorption(
>>    reload=True,
>>    data_dir=data_path,
>>    save_dir=data_path,
>>    featurizer_kwargs=feat_args)
>> train_dataset, val_dataset, test_dataset = datasets