MoleculeNet¶

The DeepChem library is packaged alongside the MoleculeNet suite of datasets. One of the most important parts of machine learning applications is finding a suitable dataset. The MoleculeNet suite has curated a whole range of datasets and loaded them into DeepChem dc.data.Dataset objects for convenience.

MoleculeNet Cheatsheet¶

When training a model or performing a benchmark, the user needs specific datasets. However, at the beginning, this search can be exhaustive and confusing. The following cheatsheet is aimed at helping DeepChem users identify more easily which dataset to use depending on their purposes.

Each row reprents a dataset where a brief description is given. Also, the columns represents the type of the data; depending on molecule properties, images or materials and how many data points they have. Each dataset is referenced with a link of the paper. Finally, there are some entries that need further information.

Cheatsheet

MoleculeNet description¶
Name	Description	Type	Data Points	Reference
BACE (Regression)	Provides bindings results for a set of inhibitors of human beta-secretase (BACE-1)	Molecules	1513	ref
BACE (Classification)	Provides bindings results for a set of inhibitors of human beta-secretase (BACE-1)	Molecules	1513	ref
BBBC (BBBC001)	Images of HT29 colon cancer cells	Images	6	ref
BBBC (BBBC002)	Images of Drosophilia Kc167 cells	Images	50	ref
BBBC (BBBC003)	DIC Images of Mouse Embryos	Images	15	ref
BBBC (BBBC004)	Synthetic Images of clustered nuclei	Images	20	ref
BBBC (BBBC004)	Synthetic Images of clustered nuclei	Images	19200	ref
BBBP	Blood-Brain Barrier Penetration designed for the modeling and prediction of barrier permeability	Binary labels on permeability properties	2000	ref
Cell Counting	Synthetic emulations of fluorescence microscopic images of bacterial cells	Images	200	ref
ChEMBL (set = ‘sparse’)	A sparse subset of ChEMBL with activity data for one target	Molecules	244 245	ref
ChEMBL (set = ‘5thresh’)	A subset of ChEMBL with activity data for at least five targets	Molecules	23 871	ref
ChEMBL25		Molecules		ref
Clearance				ref
Clintox	Compares drugs approved by the FDA and drugs that have failed clinical trials for toxicity reasons.	Molecules	1491	ref
Delaney	A regression dataset containing structures and water solubility data	Molecules	1128	ref
Factors	Merck in-house compounds that were measured for IC50 of inhibition on 12 serine proteases	Molecules	1500
Freesolv	A collection of experimental and calculated hydration free energies for small molecules in water	Molecules	643	ref
HIV	A dataset wich tested the ability to inhibit HIV replication	Molecules	40 000	ref
HOPV	Harvard Organic Photovoltaic dataset utilized as p-type materials	Molecules	350
HPPB	Thermosynamic solubility datasets
KAGGLE	in-house compounds that were measured on 15 enzyme inhibition and ADME/TOX datasets.	Molecules	100 000	ref
KINASE	In-house compounds that were measured for IC50 of inhibition on 99 protein kinases	Molecules	2 500
LIPO	Experimental results of octanol/water distribution coefficient (logD at pH 7.4)	Molecules	4 200	ref
Band Gap	Experimentally measured band gaps for inorganic crystal structure	Materials	4 604	ref
Perovskite	Contains Perovskite structures and their formation energies	Materials	18 928	ref
MP Formation Energy	Contains calculated formation energies and inorganic crystal structures from the Materials Project database	Materials	132 752	ref
MP Metallicity	Contains inorganic crystal structures from the Materials Project database labeled as metals or nonmetals	Materials	106 113	ref
MUV	Benchmark dataset selected from PubChem BioAssay by applying a refined nearest neighbor analysis	Molecules	90 000	ref
NCI
PCBA	Database consisting of biological activities of small molecules generated by high-throughput screening	Molecules	400 000	ref
PDBBIND	Experimental binding affinity data and structures of protein-ligand complexes	Molecules	“refined set” 4 852 - “general set” 12 800 - “core set” 193	ref
PPB
QM7	Subset of GDB-13 containing up to 7 heavy atoms CNOS	Molecules	7 165	ref
QM8	Dataset used in a study on modeling quantum mechanical calculations of electronic spectra and excited state energy of small molecules	Molecules	20 000	ref
QM9	Dataset that provides geometric/energetic/electronic and thermodynamic properties for a subset of GDB-17 database	Molecules	134 000	ref
SAMPL	Similat to FreeSolv dataset which provides experimental and calculated hydration free energy of small molecules in water
SIDER	The Side Effect Resource (SIDER) is a database of marketed drugs and adverse drug reactions (ADR)	Molecules	1 427	ref
Thermosol	Thermodynamic solubility datasets
Tox21	The “Toxicology in the 21st Century” (Tox21) initiative created a public database measuring the toxicity of compounds	Molecules	8 000	ref
Toxcast	Toxicology data for an extensive library of compounds based on in vitro high-throughput screening	Molecules	8 000	ref
USPTO	Subsets of USPTO dataset of organic chemical reactions extracted from US patents and patent applications	Chemical reactions SMILES	MIT 479 000 - STEREO 1 M - 50K 50 000	ref
UV	The UV dataset tests Merck’s internal compounds on 190 absorption wavelengths between 210 and 400 nm	Molecules	10 000
ZINC15	Purchasable compounds for virtual screening of small molecules to identify structures that are likely to bind to drug targets	Molecules	250K - 1M - 10M	ref
Platinum Adsorption	Different configurations of Adsorbates (i.e N and NO) on Platinum surface represented as Lattice and their formation energy	Adsorbate Configurations	648

Contributing a new dataset to MoleculeNet¶

If you are proposing a new dataset to be included in the MoleculeNet benchmarking suite, please follow the instructions below. Please review the datasets already available in MolNet before contributing.

Read the Contribution guidelines.
Open an issue to discuss the dataset you want to add to MolNet.
Write a DatasetLoader class that inherits from deepchem.molnet.load_function.molnet_loader._MolnetLoader and implements a create_dataset method. See the _QM9Loader for a simple example.
Write a load_dataset function that documents the dataset and add your load function to deepchem.molnet.__init__.py for easy importing.
Prepare your dataset as a .tar.gz or .zip file. Accepted filetypes include CSV, JSON, and SDF.
Ask a member of the technical steering committee to add your .tar.gz or .zip file to the DeepChem AWS bucket. Modify your load function to pull down the dataset from AWS.
Add documentation for your loader to the MoleculeNet docs.
Submit a [WIP] PR (Work in progress pull request) following the PR template.

Example Usage¶

Below is an example of how to load a MoleculeNet dataset and featurizer. This approach will work for any dataset in MoleculeNet by changing the load function and featurizer. For more details on the featurizers, see the Featurizers section.

import deepchem as dc
from deepchem.feat.molecule_featurizers import MolGraphConvFeaturizer

featurizer = MolGraphConvFeaturizer(use_edges=True)
dataset_dc = dc.molnet.load_qm9(featurizer=featurizer)
tasks, dataset, transformers = dataset_dc
train, valid, test = dataset

x,y,w,ids = train.X, train.y, train.w, train.ids

Note that the “w” matrix represents the weight of each sample. Some assays may have missing values, in which case the weight is 0. Otherwise, the weight is 1.

Additionally, the environment variable DEEPCHEM_DATA_DIR can be set like os.environ['DEEPCHEM_DATA_DIR'] = path/to/store/featurized/dataset. When the DEEPCHEM_DATA_DIR environment variable is set, molnet loader stores the featurized dataset in the specified directory and when the dataset has to be reloaded the next time, it will be fetched from the data directory directly rather than featurizing the raw dataset from scratch.

BACE Dataset¶

load_bace_classification(featurizer: Featurizer | str = 'ECFP', splitter: Splitter | str | None = 'scaffold', transformers: List[TransformerGenerator | str] = ['balancing'], reload: bool = True, data_dir: str | None = None, save_dir: str | None = None, **kwargs) → Tuple[List[str], Tuple[Dataset, ...], List[Transformer]][source]¶

Load BACE dataset with classification labels.

BACE dataset with classification labels (“class”). The BACE dataset contains 1513 compounds and the dataset is a binary classification dataset with labels 0 or 1.

Parameters:

featurizer (Featurizer or str) – the featurizer to use for processing the data. Alternatively you can pass one of the names from dc.molnet.featurizers as a shortcut.
splitter (Splitter or str) – the splitter to use for splitting the data into training, validation, and test sets. Alternatively you can pass one of the names from dc.molnet.splitters as a shortcut. If this is None, all the data will be included in a single dataset.
transformers (list of TransformerGenerators or strings) – the Transformers to apply to the data. Each one is specified by a TransformerGenerator or, as a shortcut, one of the names from dc.molnet.transformers.
reload (bool) – if True, the first call for a particular featurizer and splitter will cache the datasets to disk, and subsequent calls will reload the cached datasets.
data_dir (str) – a directory to save the raw data in
save_dir (str) – a directory to save the dataset in

load_bace_regression(featurizer: Featurizer | str = 'ECFP', splitter: Splitter | str | None = 'scaffold', transformers: List[TransformerGenerator | str] = ['normalization'], reload: bool = True, data_dir: str | None = None, save_dir: str | None = None, **kwargs) → Tuple[List[str], Tuple[Dataset, ...], List[Transformer]][source]¶

Load BACE dataset, regression labels

The BACE dataset provides quantitative IC50 and qualitative (binary label) binding results for a set of inhibitors of human beta-secretase 1 (BACE-1).

All data are experimental values reported in scientific literature over the past decade, some with detailed crystal structures available. A collection of 1522 compounds is provided, along with the regression labels of IC50. The number of tasks in the dataset is one.

Scaffold splitting is recommended for this dataset.

The raw data csv file contains columns below:

“mol” - SMILES representation of the molecular structure
“pIC50” - Negative log of the IC50 binding affinity
“class” - Binary labels for inhibitor

Parameters:

featurizer (Featurizer or str) – the featurizer to use for processing the data. Alternatively you can pass one of the names from dc.molnet.featurizers as a shortcut.
splitter (Splitter or str) – the splitter to use for splitting the data into training, validation, and test sets. Alternatively you can pass one of the names from dc.molnet.splitters as a shortcut. If this is None, all the data will be included in a single dataset.
transformers (list of TransformerGenerators or strings) – the Transformers to apply to the data. Each one is specified by a TransformerGenerator or, as a shortcut, one of the names from dc.molnet.transformers.
reload (bool) – if True, the first call for a particular featurizer and splitter will cache the datasets to disk, and subsequent calls will reload the cached datasets.
data_dir (str) – a directory to save the raw data in
save_dir (str) – a directory to save the dataset in

References

BBBC Datasets¶

load_bbbc001(splitter: Splitter | str | None = 'index', transformers: List[TransformerGenerator | str] = [], reload: bool = True, data_dir: str | None = None, save_dir: str | None = None, **kwargs) → Tuple[List[str], Tuple[Dataset, ...], List[Transformer]][source]¶

Load BBBC001 dataset

This dataset contains 6 images of human HT29 colon cancer cells. The task is to learn to predict the cell counts in these images. This dataset is too small to serve to train algorithms, but might serve as a good test dataset. https://data.broadinstitute.org/bbbc/BBBC001/

Parameters:

splitter (Splitter or str) – the splitter to use for splitting the data into training, validation, and test sets. Alternatively you can pass one of the names from dc.molnet.splitters as a shortcut. If this is None, all the data will be included in a single dataset.
transformers (list of TransformerGenerators or strings) – the Transformers to apply to the data. Each one is specified by a TransformerGenerator or, as a shortcut, one of the names from dc.molnet.transformers.
reload (bool) – if True, the first call for a particular featurizer and splitter will cache the datasets to disk, and subsequent calls will reload the cached datasets.
data_dir (str) – a directory to save the raw data in
save_dir (str) – a directory to save the dataset in

load_bbbc002(splitter: Splitter | str | None = 'index', transformers: List[TransformerGenerator | str] = [], reload: bool = True, data_dir: str | None = None, save_dir: str | None = None, **kwargs) → Tuple[List[str], Tuple[Dataset, ...], List[Transformer]][source]¶

Load BBBC002 dataset

This dataset contains data corresponding to 5 samples of Drosophilia Kc167 cells. There are 10 fields of view for each sample, each an image of size 512x512. Ground truth labels contain cell counts for this dataset. Full details about this dataset are present at https://data.broadinstitute.org/bbbc/BBBC002/.

Parameters:

splitter (Splitter or str) – the splitter to use for splitting the data into training, validation, and test sets. Alternatively you can pass one of the names from dc.molnet.splitters as a shortcut. If this is None, all the data will be included in a single dataset.
transformers (list of TransformerGenerators or strings) – the Transformers to apply to the data. Each one is specified by a TransformerGenerator or, as a shortcut, one of the names from dc.molnet.transformers.
reload (bool) – if True, the first call for a particular featurizer and splitter will cache the datasets to disk, and subsequent calls will reload the cached datasets.
data_dir (str) – a directory to save the raw data in
save_dir (str) – a directory to save the dataset in

load_bbbc003(load_segmentation_mask: bool = False, splitter: Splitter | str | None = 'index', transformers: List[TransformerGenerator | str] = [], reload: bool = True, data_dir: str | None = None, save_dir: str | None = None, **kwargs) → Tuple[List[str], Tuple[Dataset, ...], List[Transformer]][source]¶

Load BBBC003 dataset

This dataset contains data corresponding to 15 samples of Mouse embryos with DIC. Each image is of size 640x480. Ground truth labels contain cell counts and segmentation masks for this dataset. Full details about this dataset are present at https://data.broadinstitute.org/bbbc/BBBC003/.

Parameters:

load_segmentation_mask (bool) – if True, the dataset will contain segmentation masks as labels. Otherwise, the dataset will contain cell counts as labels.
splitter (Splitter or str) – the splitter to use for splitting the data into training, validation, and test sets. Alternatively you can pass one of the names from dc.molnet.splitters as a shortcut. If this is None, all the data will be included in a single dataset.
transformers (list of TransformerGenerators or strings) – the Transformers to apply to the data. Each one is specified by a TransformerGenerator or, as a shortcut, one of the names from dc.molnet.transformers.
reload (bool) – if True, the first call for a particular featurizer and splitter will cache the datasets to disk, and subsequent calls will reload the cached datasets.
data_dir (str) – a directory to save the raw data in
save_dir (str) – a directory to save the dataset in

Examples

Importing necessary modules

>>> import deepchem as dc
>>> import numpy as np

We can load the BBBC003 dataset with 2 types of labels: segmentation masks and cell counts. We will first load the dataset with cell counts as labels.

>>> loader = dc.molnet.load_bbbc003(load_segmentation_mask=False)
>>> tasks, dataset, transformers = loader
>>> train, val, test = dataset

We now have a dataset with 15 samples, each with 300 cells. The images are of size 640x480. The labels are cell counts. We can verify this as follows:

>>> train.X.shape
(12,)
>>> train.y.shape
(12,)

We will now load the dataset with segmentation masks as labels.

>>> loader = dc.molnet.load_bbbc003(load_segmentation_mask=True)
>>> tasks, dataset, transformers = loader
>>> train, val, test = dataset

We now have a dataset with 15 samples, each with 300 cells. The images are of size 640x480. The labels are segmentation masks. We can verify this as follows:

>>> print(train.X.shape)
(12,)
>>> print(train.y.shape)
(12,)

Note: The image labelled ‘7_19_M2E15.tif’ is transposed to 480x640 in the source file along with it’s segementation mask. To match it with the other images, we need to transpose it back to 640x480.

This image is found at index 6 in the train dataset (Assuming no shuffling has taken place).

First, we load the dataset as usual and split it into X, y, w and ids. Here, X is the list of input images, y is the list of labels, w is the list of weights and ids is the list of IDs for each sample.

>>> train_x, train_y, train_w, train_ids = train.X, train.y, train.w, train.ids

We can now transpose the image at index 6 in the input data (train_x): >>> train_x[6] = train_x[6].T

We can now verify that the image is of size 640x480: >>> print(train_x[6].shape) (640, 480)

This is also seen in the segmentation mask with the same filename and index, in which case, we transpose the label (train_y) instead of the input data:

>>> train_y[6] = train_y[6].T

We can now verify that the image is of size 640x480: >>> train_y[6].shape (640, 480)

load_bbbc004(overlap_probability: float = 0.0, load_segmentation_mask: bool = False, splitter: Splitter | str | None = 'index', transformers: List[TransformerGenerator | str] = [], reload: bool = True, data_dir: str | None = None, save_dir: str | None = None, **kwargs) → Tuple[List[str], Tuple[Dataset, ...], List[Transformer]][source]¶

Load BBBC004 dataset

This dataset contains data corresponding to 20 samples of synthetically generated fluorescent cell population images. There are 300 cells in each sample, each an image of size 950x950. Ground truth labels contain cell counts and segmentation masks for this dataset. Full details about this dataset are present at https://data.broadinstitute.org/bbbc/BBBC004/.

Parameters:

overlap_probability (float from list {0.0, 0.15, 0.3, 0.45, 0.6}) – the overlap probability of the synthetic cells in the images
load_segmentation_mask (bool) – if True, the dataset will contain segmentation masks as labels. Otherwise, the dataset will contain cell counts as labels.
splitter (Splitter or str) – the splitter to use for splitting the data into training, validation, and test sets. Alternatively you can pass one of the names from dc.molnet.splitters as a shortcut. If this is None, all the data will be included in a single dataset.
transformers (list of TransformerGenerators or strings) – the Transformers to apply to the data. Each one is specified by a TransformerGenerator or, as a shortcut, one of the names from dc.molnet.transformers.
reload (bool) – if True, the first call for a particular featurizer and splitter will cache the datasets to disk, and subsequent calls will reload the cached datasets.
data_dir (str) – a directory to save the raw data in
save_dir (str) – a directory to save the dataset in

Examples

Importing necessary modules

>>> import deepchem as dc
>>> import numpy as np

We can load the BBBC004 dataset with 2 types of labels: segmentation masks and cell counts. We will first load the dataset with cell counts as labels.

>>> loader = dc.molnet.load_bbbc004(overlap_probability=0.0, load_segmentation_mask=False)
>>> tasks, dataset, transformers = loader
>>> train, val, test = dataset

We now have a dataset with 20 samples, each with 300 cells. The images are of size 950x950. The labels are cell counts. We can verify this as follows:

>>> train.X.shape
(16, 950, 950)
>>> train.y.shape
(16,)

We will now load the dataset with segmentation masks as labels.

>>> loader = dc.molnet.load_bbbc004(overlap_probability=0.0, load_segmentation_mask=True)
>>> tasks, dataset, transformers = loader
>>> train, val, test = dataset

We now have a dataset with 20 samples, each with 300 cells. The images are of size 950x950. The labels are segmentation masks. We can verify this as follows:

>>> train.X.shape
(16, 950, 950)
>>> train.y.shape
(16, 950, 950, 3)

load_bbbc005(splitter: Splitter | str | None = 'index', transformers: List[TransformerGenerator | str] = [], reload: bool = True, data_dir: str | None = None, save_dir: str | None = None, **kwargs) → Tuple[List[str], Tuple[Dataset, ...], List[Transformer]][source]¶

Load BBBC005 dataset

This dataset contains data corresponding to 19,200 samples of synthetically generated fluorescent cell population images. These images were simulated for a given cell count with a clustering probablity of 25% and a CCD noise variance of 0.0001. Focus blur was simulated by applying varying Guassian filters to the images. Each image is of size 520x696. Ground truth labels contain cell counts for this dataset. Full details about this dataset are present at https://data.broadinstitute.org/bbbc/BBBC005/.

Parameters:

splitter (Splitter or str) – the splitter to use for splitting the data into training, validation, and test sets. Alternatively you can pass one of the names from dc.molnet.splitters as a shortcut. If this is None, all the data will be included in a single dataset.
transformers (list of TransformerGenerators or strings) – the Transformers to apply to the data. Each one is specified by a TransformerGenerator or, as a shortcut, one of the names from dc.molnet.transformers.
reload (bool) – if True, the first call for a particular featurizer and splitter will cache the datasets to disk, and subsequent calls will reload the cached datasets.
data_dir (str) – a directory to save the raw data in
save_dir (str) – a directory to save the dataset in

Examples

Importing necessary modules

>> import deepchem as dc >> import numpy as np

We will now load the BBBC005 dataset with cell counts as labels.

>> loader = dc.molnet.load_bbbc005() >> tasks, dataset, transformers = loader >> train, val, test = dataset

We now have a dataset with a total of 19,200 samples with cell counts in the range of 1-100. The images are of size 520x696. The labels are cell counts. We have a train-val-test split of 80:10:10. We can verify this as follows:

>> train.X.shape (15360, 520, 696) >> train.y.shape (15360,)

BBBP Datasets¶

BBBP stands for Blood-Brain-Barrier Penetration

load_bbbp(featurizer: Featurizer | str = 'ECFP', splitter: Splitter | str | None = 'scaffold', transformers: List[TransformerGenerator | str] = ['balancing'], reload: bool = True, data_dir: str | None = None, save_dir: str | None = None, **kwargs) → Tuple[List[str], Tuple[Dataset, ...], List[Transformer]][source]¶

Load BBBP dataset

The blood-brain barrier penetration (BBBP) dataset is designed for the modeling and prediction of barrier permeability. As a membrane separating circulating blood and brain extracellular fluid, the blood-brain barrier blocks most drugs, hormones and neurotransmitters. Thus penetration of the barrier forms a long-standing issue in development of drugs targeting central nervous system.

This dataset includes binary labels for over 2000 compounds on their permeability properties.

Scaffold splitting is recommended for this dataset.

The raw data csv file contains columns below:

“name” - Name of the compound
“smiles” - SMILES representation of the molecular structure
“p_np” - Binary labels for penetration/non-penetration

Parameters:

featurizer (Featurizer or str) – the featurizer to use for processing the data. Alternatively you can pass one of the names from dc.molnet.featurizers as a shortcut.
splitter (Splitter or str) – the splitter to use for splitting the data into training, validation, and test sets. Alternatively you can pass one of the names from dc.molnet.splitters as a shortcut. If this is None, all the data will be included in a single dataset.
transformers (list of TransformerGenerators or strings) – the Transformers to apply to the data. Each one is specified by a TransformerGenerator or, as a shortcut, one of the names from dc.molnet.transformers.
reload (bool) – if True, the first call for a particular featurizer and splitter will cache the datasets to disk, and subsequent calls will reload the cached datasets.
data_dir (str) – a directory to save the raw data in
save_dir (str) – a directory to save the dataset in

References

Cell Counting Datasets¶

load_cell_counting(splitter: Splitter | str | None = None, transformers: List[TransformerGenerator | str] = [], reload: bool = True, data_dir: str | None = None, save_dir: str | None = None, **kwargs) → Tuple[List[str], Tuple[Dataset, ...], List[Transformer]][source]¶

Load Cell Counting dataset.

Loads the cell counting dataset from http://www.robots.ox.ac.uk/~vgg/research/counting/index_org.html.

Parameters:

splitter (Splitter or str) – the splitter to use for splitting the data into training, validation, and test sets. Alternatively you can pass one of the names from dc.molnet.splitters as a shortcut. If this is None, all the data will be included in a single dataset.
transformers (list of TransformerGenerators or strings) – the Transformers to apply to the data. Each one is specified by a TransformerGenerator or, as a shortcut, one of the names from dc.molnet.transformers.
reload (bool) – if True, the first call for a particular featurizer and splitter will cache the datasets to disk, and subsequent calls will reload the cached datasets.
data_dir (str) – a directory to save the raw data in
save_dir (str) – a directory to save the dataset in

Chembl Datasets¶

load_chembl(featurizer: Featurizer | str = 'ECFP', splitter: Splitter | str | None = 'scaffold', transformers: List[TransformerGenerator | str] = ['normalization'], set: str = '5thresh', reload: bool = True, data_dir: str | None = None, save_dir: str | None = None, **kwargs) → Tuple[List[str], Tuple[Dataset, ...], List[Transformer]][source]¶

Load the ChEMBL dataset.

This dataset is based on release 22.1 of the data from https://www.ebi.ac.uk/chembl/. Two subsets of the data are available, depending on the “set” argument. “sparse” is a large dataset with 244,245 compounds. As the name suggests, the data is extremely sparse, with most compounds having activity data for only one target. “5thresh” is a much smaller set (23,871 compounds) that includes only compounds with activity data for at least five targets.

Parameters:

featurizer (Featurizer or str) – the featurizer to use for processing the data. Alternatively you can pass one of the names from dc.molnet.featurizers as a shortcut.
splitter (Splitter or str) – the splitter to use for splitting the data into training, validation, and test sets. Alternatively you can pass one of the names from dc.molnet.splitters as a shortcut. If this is None, all the data will be included in a single dataset.
transformers (list of TransformerGenerators or strings) – the Transformers to apply to the data. Each one is specified by a TransformerGenerator or, as a shortcut, one of the names from dc.molnet.transformers.
set (str) – the subset to load, either “sparse” or “5thresh”
reload (bool) – if True, the first call for a particular featurizer and splitter will cache the datasets to disk, and subsequent calls will reload the cached datasets.
data_dir (str) – a directory to save the raw data in
save_dir (str) – a directory to save the dataset in

Chembl25 Datasets¶

load_chembl25(featurizer: Featurizer | str = 'ECFP', splitter: Splitter | str | None = 'scaffold', transformers: List[TransformerGenerator | str] = ['normalization'], reload: bool = True, data_dir: str | None = None, save_dir: str | None = None, **kwargs) → Tuple[List[str], Tuple[Dataset, ...], List[Transformer]][source]¶

Loads the ChEMBL25 dataset, featurizes it, and does a split.

Parameters:

featurizer (Featurizer or str) – the featurizer to use for processing the data. Alternatively you can pass one of the names from dc.molnet.featurizers as a shortcut.
splitter (Splitter or str) – the splitter to use for splitting the data into training, validation, and test sets. Alternatively you can pass one of the names from dc.molnet.splitters as a shortcut. If this is None, all the data will be included in a single dataset.
transformers (list of TransformerGenerators or strings) – the Transformers to apply to the data. Each one is specified by a TransformerGenerator or, as a shortcut, one of the names from dc.molnet.transformers.
reload (bool) – if True, the first call for a particular featurizer and splitter will cache the datasets to disk, and subsequent calls will reload the cached datasets.
data_dir (str) – a directory to save the raw data in
save_dir (str) – a directory to save the dataset in

Clearance Datasets¶

load_clearance(featurizer: Featurizer | str = 'ECFP', splitter: Splitter | str | None = 'scaffold', transformers: List[TransformerGenerator | str] = ['log'], reload: bool = True, data_dir: str | None = None, save_dir: str | None = None, **kwargs) → Tuple[List[str], Tuple[Dataset, ...], List[Transformer]][source]¶

Load clearance datasets.

Parameters:

featurizer (Featurizer or str) – the featurizer to use for processing the data. Alternatively you can pass one of the names from dc.molnet.featurizers as a shortcut.
splitter (Splitter or str) – the splitter to use for splitting the data into training, validation, and test sets. Alternatively you can pass one of the names from dc.molnet.splitters as a shortcut. If this is None, all the data will be included in a single dataset.
transformers (list of TransformerGenerators or strings) – the Transformers to apply to the data. Each one is specified by a TransformerGenerator or, as a shortcut, one of the names from dc.molnet.transformers.
reload (bool) – if True, the first call for a particular featurizer and splitter will cache the datasets to disk, and subsequent calls will reload the cached datasets.
data_dir (str) – a directory to save the raw data in
save_dir (str) – a directory to save the dataset in

Clintox Datasets¶

load_clintox(featurizer: Featurizer | str = 'ECFP', splitter: Splitter | str | None = 'scaffold', transformers: List[TransformerGenerator | str] = ['balancing'], reload: bool = True, data_dir: str | None = None, save_dir: str | None = None, **kwargs) → Tuple[List[str], Tuple[Dataset, ...], List[Transformer]][source]¶

Load ClinTox dataset

The ClinTox dataset compares drugs approved by the FDA and drugs that have failed clinical trials for toxicity reasons. The dataset includes two classification tasks for 1491 drug compounds with known chemical structures:

clinical trial toxicity (or absence of toxicity)
FDA approval status.

List of FDA-approved drugs are compiled from the SWEETLEAD database, and list of drugs that failed clinical trials for toxicity reasons are compiled from the Aggregate Analysis of ClinicalTrials.gov(AACT) database.

Random splitting is recommended for this dataset.

The raw data csv file contains columns below:

“smiles” - SMILES representation of the molecular structure
“FDA_APPROVED” - FDA approval status
“CT_TOX” - Clinical trial results

Parameters:

featurizer (Featurizer or str) – the featurizer to use for processing the data. Alternatively you can pass one of the names from dc.molnet.featurizers as a shortcut.
splitter (Splitter or str) – the splitter to use for splitting the data into training, validation, and test sets. Alternatively you can pass one of the names from dc.molnet.splitters as a shortcut. If this is None, all the data will be included in a single dataset.
transformers (list of TransformerGenerators or strings) – the Transformers to apply to the data. Each one is specified by a TransformerGenerator or, as a shortcut, one of the names from dc.molnet.transformers.
reload (bool) – if True, the first call for a particular featurizer and splitter will cache the datasets to disk, and subsequent calls will reload the cached datasets.
data_dir (str) – a directory to save the raw data in
save_dir (str) – a directory to save the dataset in

References

Delaney Datasets¶

load_delaney(featurizer: Featurizer | str = 'ECFP', splitter: Splitter | str | None = 'scaffold', transformers: List[TransformerGenerator | str] = ['normalization'], reload: bool = True, data_dir: str | None = None, save_dir: str | None = None, **kwargs) → Tuple[List[str], Tuple[Dataset, ...], List[Transformer]][source]¶

Load Delaney dataset

The Delaney (ESOL) dataset a regression dataset containing structures and water solubility data for 1128 compounds. The dataset is widely used to validate machine learning models on estimating solubility directly from molecular structures (as encoded in SMILES strings).

Scaffold splitting is recommended for this dataset.

The raw data csv file contains columns below:

“Compound ID” - Name of the compound
“smiles” - SMILES representation of the molecular structure
“measured log solubility in mols per litre” - Log-scale water solubility
of the compound, used as label

Parameters:

featurizer (Featurizer or str) – the featurizer to use for processing the data. Alternatively you can pass one of the names from dc.molnet.featurizers as a shortcut.
splitter (Splitter or str) – the splitter to use for splitting the data into training, validation, and test sets. Alternatively you can pass one of the names from dc.molnet.splitters as a shortcut. If this is None, all the data will be included in a single dataset.
transformers (list of TransformerGenerators or strings) – the Transformers to apply to the data. Each one is specified by a TransformerGenerator or, as a shortcut, one of the names from dc.molnet.transformers.
reload (bool) – if True, the first call for a particular featurizer and splitter will cache the datasets to disk, and subsequent calls will reload the cached datasets.
data_dir (str) – a directory to save the raw data in
save_dir (str) – a directory to save the dataset in

References

Factors Datasets¶

load_factors(shard_size=2000, featurizer=None, split=None, reload=True)[source]¶

Loads FACTOR dataset; does not do train/test split

The Factors dataset is an in-house dataset from Merck that was first introduced in the following paper: Ramsundar, Bharath, et al. “Is multitask deep learning practical for pharma?.” Journal of chemical information and modeling 57.8 (2017): 2068-2076.

It contains 1500 Merck in-house compounds that were measured for IC50 of inhibition on 12 serine proteases. Unlike most of the other datasets featured in MoleculeNet, the Factors collection does not have structures for the compounds tested since they were proprietary Merck compounds. However, the collection does feature pre-computed descriptors for these compounds.

Note that the original train/valid/test split from the source data was preserved here, so this function doesn’t allow for alternate modes of splitting. Similarly, since the source data came pre-featurized, it is not possible to apply alternative featurizations.

Parameters:

shard_size (int, optional) – Size of the DiskDataset shards to write on disk
featurizer (optional) – Ignored since featurization pre-computed
split (optional) – Ignored since split pre-computed
reload (bool, optional) – Whether to automatically re-load from disk

Freesolv Dataset¶

load_freesolv(featurizer: ~deepchem.feat.base_classes.Featurizer | str = MATFeaturizer[], splitter: ~deepchem.splits.splitters.Splitter | str | None = 'random', transformers: ~typing.List[~deepchem.molnet.load_function.molnet_loader.TransformerGenerator | str] = ['normalization'], reload: bool = True, data_dir: str | None = None, save_dir: str | None = None, **kwargs) → Tuple[List[str], Tuple[Dataset, ...], List[Transformer]][source]¶

Load Freesolv dataset

The FreeSolv dataset is a collection of experimental and calculated hydration free energies for small molecules in water, along with their experiemental values. Here, we are using a modified version of the dataset with the molecule smile string and the corresponding experimental hydration free energies.

Random splitting is recommended for this dataset.

The raw data csv file contains columns below:

“mol” - SMILES representation of the molecular structure
“y” - Experimental hydration free energy

Parameters:

featurizer (Featurizer or str) – the featurizer to use for processing the data. Alternatively you can pass one of the names from dc.molnet.featurizers as a shortcut.
splitter (Splitter or str) – the splitter to use for splitting the data into training, validation, and test sets. Alternatively you can pass one of the names from dc.molnet.splitters as a shortcut. If this is None, all the data will be included in a single dataset.
transformers (list of TransformerGenerators or strings) – the Transformers to apply to the data. Each one is specified by a TransformerGenerator or, as a shortcut, one of the names from dc.molnet.transformers.
reload (bool) – if True, the first call for a particular featurizer and splitter will cache the datasets to disk, and subsequent calls will reload the cached datasets.
data_dir (str) – a directory to save the raw data in
save_dir (str) – a directory to save the dataset in

References

HIV Datasets¶

load_hiv(featurizer: Featurizer | str = 'ECFP', splitter: Splitter | str | None = 'scaffold', transformers: List[TransformerGenerator | str] = ['balancing'], reload: bool = True, data_dir: str | None = None, save_dir: str | None = None, **kwargs) → Tuple[List[str], Tuple[Dataset, ...], List[Transformer]][source]¶

Load HIV dataset

The HIV dataset was introduced by the Drug Therapeutics Program (DTP) AIDS Antiviral Screen, which tested the ability to inhibit HIV replication for over 40,000 compounds. Screening results were evaluated and placed into three categories: confirmed inactive (CI),confirmed active (CA) and confirmed moderately active (CM). We further combine the latter two labels, making it a classification task between inactive (CI) and active (CA and CM).

Scaffold splitting is recommended for this dataset.

The raw data csv file contains columns below:

“smiles”: SMILES representation of the molecular structure
“activity”: Three-class labels for screening results: CI/CM/CA
“HIV_active”: Binary labels for screening results: 1 (CA/CM) and 0 (CI)

Parameters:

featurizer (Featurizer or str) – the featurizer to use for processing the data. Alternatively you can pass one of the names from dc.molnet.featurizers as a shortcut.
splitter (Splitter or str) – the splitter to use for splitting the data into training, validation, and test sets. Alternatively you can pass one of the names from dc.molnet.splitters as a shortcut. If this is None, all the data will be included in a single dataset.
transformers (list of TransformerGenerators or strings) – the Transformers to apply to the data. Each one is specified by a TransformerGenerator or, as a shortcut, one of the names from dc.molnet.transformers.
reload (bool) – if True, the first call for a particular featurizer and splitter will cache the datasets to disk, and subsequent calls will reload the cached datasets.
data_dir (str) – a directory to save the raw data in
save_dir (str) – a directory to save the dataset in

References

HOPV Datasets¶

HOPV stands for the Harvard Organic Photovoltaic Dataset.

load_hopv(featurizer: Featurizer | str = 'ECFP', splitter: Splitter | str | None = 'scaffold', transformers: List[TransformerGenerator | str] = ['normalization'], reload: bool = True, data_dir: str | None = None, save_dir: str | None = None, **kwargs) → Tuple[List[str], Tuple[Dataset, ...], List[Transformer]][source]¶

Load HOPV datasets. Does not do train/test split

The HOPV datasets consist of the “Harvard Organic Photovoltaic Dataset. This dataset includes 350 small molecules and polymers that were utilized as p-type materials in OPVs. Experimental properties include: HOMO [a.u.], LUMO [a.u.], Electrochemical gap [a.u.], Optical gap [a.u.], Power conversion efficiency [%], Open circuit potential [V], Short circuit current density [mA/cm^2], and fill factor [%]. Theoretical calculations in the original dataset have been removed (for now).

Lopez, Steven A., et al. “The Harvard organic photovoltaic dataset.” Scientific data 3.1 (2016): 1-7.

Parameters:

featurizer (Featurizer or str) – the featurizer to use for processing the data. Alternatively you can pass one of the names from dc.molnet.featurizers as a shortcut.
splitter (Splitter or str) – the splitter to use for splitting the data into training, validation, and test sets. Alternatively you can pass one of the names from dc.molnet.splitters as a shortcut. If this is None, all the data will be included in a single dataset.
transformers (list of TransformerGenerators or strings) – the Transformers to apply to the data. Each one is specified by a TransformerGenerator or, as a shortcut, one of the names from dc.molnet.transformers.
reload (bool) – if True, the first call for a particular featurizer and splitter will cache the datasets to disk, and subsequent calls will reload the cached datasets.
data_dir (str) – a directory to save the raw data in
save_dir (str) – a directory to save the dataset in

HPPB Datasets¶

load_hppb(featurizer: Featurizer | str = 'ECFP', splitter: Splitter | str | None = 'scaffold', transformers: List[TransformerGenerator | str] = ['log'], reload: bool = True, data_dir: str | None = None, save_dir: str | None = None, **kwargs) → Tuple[List[str], Tuple[Dataset, ...], List[Transformer]][source]¶

Loads the thermodynamic solubility datasets.

Parameters:

featurizer (Featurizer or str) – the featurizer to use for processing the data. Alternatively you can pass one of the names from dc.molnet.featurizers as a shortcut.
splitter (Splitter or str) – the splitter to use for splitting the data into training, validation, and test sets. Alternatively you can pass one of the names from dc.molnet.splitters as a shortcut. If this is None, all the data will be included in a single dataset.
transformers (list of TransformerGenerators or strings) – the Transformers to apply to the data. Each one is specified by a TransformerGenerator or, as a shortcut, one of the names from dc.molnet.transformers.
reload (bool) – if True, the first call for a particular featurizer and splitter will cache the datasets to disk, and subsequent calls will reload the cached datasets.
data_dir (str) – a directory to save the raw data in
save_dir (str) – a directory to save the dataset in

KAGGLE Datasets¶

load_kaggle(shard_size=2000, featurizer=None, split=None, reload=True)[source]¶

Loads kaggle datasets. Generates if not stored already.

The Kaggle dataset is an in-house dataset from Merck that was first introduced in the following paper:

Ma, Junshui, et al. “Deep neural nets as a method for quantitative structure–activity relationships.” Journal of chemical information and modeling 55.2 (2015): 263-274.

It contains 100,000 unique Merck in-house compounds that were measured on 15 enzyme inhibition and ADME/TOX datasets. Unlike most of the other datasets featured in MoleculeNet, the Kaggle collection does not have structures for the compounds tested since they were proprietary Merck compounds. However, the collection does feature pre-computed descriptors for these compounds.

Note that the original train/valid/test split from the source data was preserved here, so this function doesn’t allow for alternate modes of splitting. Similarly, since the source data came pre-featurized, it is not possible to apply alternative featurizations.

Parameters:

shard_size (int, optional) – Size of the DiskDataset shards to write on disk
featurizer (optional) – Ignored since featurization pre-computed
split (optional) – Ignored since split pre-computed
reload (bool, optional) – Whether to automatically re-load from disk

Kinase Datasets¶

load_kinase(shard_size=2000, featurizer=None, split=None, reload=True)[source]¶

Loads Kinase datasets, does not do train/test split

The Kinase dataset is an in-house dataset from Merck that was first introduced in the following paper: Ramsundar, Bharath, et al. “Is multitask deep learning practical for pharma?.” Journal of chemical information and modeling 57.8 (2017): 2068-2076.

It contains 2500 Merck in-house compounds that were measured for IC50 of inhibition on 99 protein kinases. Unlike most of the other datasets featured in MoleculeNet, the Kinase collection does not have structures for the compounds tested since they were proprietary Merck compounds. However, the collection does feature pre-computed descriptors for these compounds.

Note that the original train/valid/test split from the source data was preserved here, so this function doesn’t allow for alternate modes of splitting. Similarly, since the source data came pre-featurized, it is not possible to apply alternative featurizations.

Parameters:

shard_size (int, optional) – Size of the DiskDataset shards to write on disk
featurizer (optional) – Ignored since featurization pre-computed
split (optional) – Ignored since split pre-computed
reload (bool, optional) – Whether to automatically re-load from disk

Lipo Datasets¶

load_lipo(featurizer: Featurizer | str = 'ECFP', splitter: Splitter | str | None = 'scaffold', transformers: List[TransformerGenerator | str] = ['normalization'], reload: bool = True, data_dir: str | None = None, save_dir: str | None = None, **kwargs) → Tuple[List[str], Tuple[Dataset, ...], List[Transformer]][source]¶

Load Lipophilicity dataset

Lipophilicity is an important feature of drug molecules that affects both membrane permeability and solubility. The lipophilicity dataset, curated from ChEMBL database, provides experimental results of octanol/water distribution coefficient (logD at pH 7.4) of 4200 compounds.

Scaffold splitting is recommended for this dataset.

The raw data csv file contains columns below:

“smiles” - SMILES representation of the molecular structure
“exp” - Measured octanol/water distribution coefficient (logD) of the
compound, used as label

Parameters:

featurizer (Featurizer or str) – the featurizer to use for processing the data. Alternatively you can pass one of the names from dc.molnet.featurizers as a shortcut.
splitter (Splitter or str) – the splitter to use for splitting the data into training, validation, and test sets. Alternatively you can pass one of the names from dc.molnet.splitters as a shortcut. If this is None, all the data will be included in a single dataset.
transformers (list of TransformerGenerators or strings) – the Transformers to apply to the data. Each one is specified by a TransformerGenerator or, as a shortcut, one of the names from dc.molnet.transformers.
reload (bool) – if True, the first call for a particular featurizer and splitter will cache the datasets to disk, and subsequent calls will reload the cached datasets.
data_dir (str) – a directory to save the raw data in
save_dir (str) – a directory to save the dataset in

References

Materials Datasets¶

Materials datasets include inorganic crystal structures, chemical compositions, and target properties like formation energies and band gaps. Machine learning problems in materials science commonly include predicting the value of a continuous (regression) or categorical (classification) property of a material based on its chemical composition or crystal structure. “Inverse design” is also of great interest, in which ML methods generate crystal structures that have a desired property. Other areas where ML is applicable in materials include: discovering new or modified phenomenological models that describe material behavior

load_bandgap(featurizer: ~deepchem.feat.base_classes.Featurizer | str = ElementPropertyFingerprint[data_source='matminer'], splitter: ~deepchem.splits.splitters.Splitter | str | None = 'random', transformers: ~typing.List[~deepchem.molnet.load_function.molnet_loader.TransformerGenerator | str] = ['normalization'], reload: bool = True, data_dir: str | None = None, save_dir: str | None = None, **kwargs) → Tuple[List[str], Tuple[Dataset, ...], List[Transformer]][source]¶

Load band gap dataset.

Contains 4604 experimentally measured band gaps for inorganic crystal structure compositions. In benchmark studies, random forest models achieved a mean average error of 0.45 eV during five-fold nested cross validation on this dataset.

For more details on the dataset see [1]_. For more details on previous benchmarks for this dataset, see [2]_.

Parameters:

featurizer (Featurizer or str) – the featurizer to use for processing the data. Alternatively you can pass one of the names from dc.molnet.featurizers as a shortcut.
splitter (Splitter or str) – the splitter to use for splitting the data into training, validation, and test sets. Alternatively you can pass one of the names from dc.molnet.splitters as a shortcut. If this is None, all the data will be included in a single dataset.
transformers (list of TransformerGenerators or strings) – the Transformers to apply to the data. Each one is specified by a TransformerGenerator or, as a shortcut, one of the names from dc.molnet.transformers.
reload (bool) – if True, the first call for a particular featurizer and splitter will cache the datasets to disk, and subsequent calls will reload the cached datasets.
data_dir (str) – a directory to save the raw data in
save_dir (str) – a directory to save the dataset in

Returns:

tasks, datasets, transformers –

taskslist: Column names corresponding to machine learning target variables.
datasetstuple: train, validation, test splits of data as deepchem.data.datasets.Dataset instances.
transformerslist: deepchem.trans.transformers.Transformer instances applied to dataset.

Return type:

tuple

References

Examples

>>>
>> import deepchem as dc
>> tasks, datasets, transformers = dc.molnet.load_bandgap()
>> train_dataset, val_dataset, test_dataset = datasets
>> n_tasks = len(tasks)
>> n_features = train_dataset.get_data_shape()[0]
>> model = dc.models.MultitaskRegressor(n_tasks, n_features)

load_perovskite(featurizer: ~deepchem.feat.base_classes.Featurizer | str = DummyFeaturizer[], splitter: ~deepchem.splits.splitters.Splitter | str | None = 'random', transformers: ~typing.List[~deepchem.molnet.load_function.molnet_loader.TransformerGenerator | str] = ['normalization'], reload: bool = True, data_dir: str | None = None, save_dir: str | None = None, **kwargs) → Tuple[List[str], Tuple[Dataset, ...], List[Transformer]][source]¶

Load perovskite dataset.

Contains 18928 perovskite structures and their formation energies. In benchmark studies, random forest models and crystal graph neural networks achieved mean average error of 0.23 and 0.05 eV/atom, respectively, during five-fold nested cross validation on this dataset.

For more details on the dataset see [1]_. For more details on previous benchmarks for this dataset, see [2]_.

Parameters:

featurizer (Featurizer or str) – the featurizer to use for processing the data. Alternatively you can pass one of the names from dc.molnet.featurizers as a shortcut.
splitter (Splitter or str) – the splitter to use for splitting the data into training, validation, and test sets. Alternatively you can pass one of the names from dc.molnet.splitters as a shortcut. If this is None, all the data will be included in a single dataset.
transformers (list of TransformerGenerators or strings) – the Transformers to apply to the data. Each one is specified by a TransformerGenerator or, as a shortcut, one of the names from dc.molnet.transformers.
reload (bool) – if True, the first call for a particular featurizer and splitter will cache the datasets to disk, and subsequent calls will reload the cached datasets.
data_dir (str) – a directory to save the raw data in
save_dir (str) – a directory to save the dataset in

Returns:

tasks, datasets, transformers –

taskslist: Column names corresponding to machine learning target variables.
datasetstuple: train, validation, test splits of data as deepchem.data.datasets.Dataset instances.
transformerslist: deepchem.trans.transformers.Transformer instances applied to dataset.

Return type:

tuple

References

Examples

>>> import deepchem as dc
>>> tasks, datasets, transformers = dc.molnet.load_perovskite()
>>> train_dataset, val_dataset, test_dataset = datasets
>>> model = dc.models.CGCNNModel(mode='regression', batch_size=32, learning_rate=0.001)

load_mp_formation_energy(featurizer: ~deepchem.feat.base_classes.Featurizer | str = SineCoulombMatrix[max_atoms=100, flatten=True], splitter: ~deepchem.splits.splitters.Splitter | str | None = 'random', transformers: ~typing.List[~deepchem.molnet.load_function.molnet_loader.TransformerGenerator | str] = ['normalization'], reload: bool = True, data_dir: str | None = None, save_dir: str | None = None, **kwargs) → Tuple[List[str], Tuple[Dataset, ...], List[Transformer]][source]¶

Load mp formation energy dataset.

Contains 132752 calculated formation energies and inorganic crystal structures from the Materials Project database. In benchmark studies, random forest models achieved a mean average error of 0.116 eV/atom during five-folded nested cross validation on this dataset.

For more details on the dataset see [1]_. For more details on previous benchmarks for this dataset, see [2]_.

Parameters:

featurizer (Featurizer or str) – the featurizer to use for processing the data. Alternatively you can pass one of the names from dc.molnet.featurizers as a shortcut.
splitter (Splitter or str) – the splitter to use for splitting the data into training, validation, and test sets. Alternatively you can pass one of the names from dc.molnet.splitters as a shortcut. If this is None, all the data will be included in a single dataset.
transformers (list of TransformerGenerators or strings) – the Transformers to apply to the data. Each one is specified by a TransformerGenerator or, as a shortcut, one of the names from dc.molnet.transformers.
reload (bool) – if True, the first call for a particular featurizer and splitter will cache the datasets to disk, and subsequent calls will reload the cached datasets.
data_dir (str) – a directory to save the raw data in
save_dir (str) – a directory to save the dataset in

Returns:

tasks, datasets, transformers –

taskslist: Column names corresponding to machine learning target variables.
datasetstuple: train, validation, test splits of data as deepchem.data.datasets.Dataset instances.
transformerslist: deepchem.trans.transformers.Transformer instances applied to dataset.

Return type:

tuple

References

Examples

>>>
>> import deepchem as dc
>> tasks, datasets, transformers = dc.molnet.load_mp_formation_energy()
>> train_dataset, val_dataset, test_dataset = datasets
>> n_tasks = len(tasks)
>> n_features = train_dataset.get_data_shape()[0]
>> model = dc.models.MultitaskRegressor(n_tasks, n_features)

load_mp_metallicity(featurizer: ~deepchem.feat.base_classes.Featurizer | str = SineCoulombMatrix[max_atoms=100, flatten=True], splitter: ~deepchem.splits.splitters.Splitter | str | None = 'random', transformers: ~typing.List[~deepchem.molnet.load_function.molnet_loader.TransformerGenerator | str] = ['balancing'], reload: bool = True, data_dir: str | None = None, save_dir: str | None = None, **kwargs) → Tuple[List[str], Tuple[Dataset, ...], List[Transformer]][source]¶

Load mp formation energy dataset.

Contains 106113 inorganic crystal structures from the Materials Project database labeled as metals or nonmetals. In benchmark studies, random forest models achieved a mean ROC-AUC of 0.9 during five-folded nested cross validation on this dataset.

For more details on the dataset see [1]_. For more details on previous benchmarks for this dataset, see [2]_.

Parameters:

featurizer (Featurizer or str) – the featurizer to use for processing the data. Alternatively you can pass one of the names from dc.molnet.featurizers as a shortcut.
splitter (Splitter or str) – the splitter to use for splitting the data into training, validation, and test sets. Alternatively you can pass one of the names from dc.molnet.splitters as a shortcut. If this is None, all the data will be included in a single dataset.
transformers (list of TransformerGenerators or strings) – the Transformers to apply to the data. Each one is specified by a TransformerGenerator or, as a shortcut, one of the names from dc.molnet.transformers.
reload (bool) – if True, the first call for a particular featurizer and splitter will cache the datasets to disk, and subsequent calls will reload the cached datasets.
data_dir (str) – a directory to save the raw data in
save_dir (str) – a directory to save the dataset in

Returns:

tasks, datasets, transformers –

taskslist: Column names corresponding to machine learning target variables.
datasetstuple: train, validation, test splits of data as deepchem.data.datasets.Dataset instances.
transformerslist: deepchem.trans.transformers.Transformer instances applied to dataset.

Return type:

tuple

References

Examples

>>>
>> import deepchem as dc
>> tasks, datasets, transformers = dc.molnet.load_mp_metallicity()
>> train_dataset, val_dataset, test_dataset = datasets
>> n_tasks = len(tasks)
>> n_features = train_dataset.get_data_shape()[0]
>> model = dc.models.MultitaskRegressor(n_tasks, n_features)

MUV Datasets¶

load_muv(featurizer: Featurizer | str = 'ECFP', splitter: Splitter | str | None = 'scaffold', transformers: List[TransformerGenerator | str] = ['balancing'], reload: bool = True, data_dir: str | None = None, save_dir: str | None = None, **kwargs) → Tuple[List[str], Tuple[Dataset, ...], List[Transformer]][source]¶

Load MUV dataset

The Maximum Unbiased Validation (MUV) group is a benchmark dataset selected from PubChem BioAssay by applying a refined nearest neighbor analysis.

The MUV dataset contains 17 challenging tasks for around 90 thousand compounds and is specifically designed for validation of virtual screening techniques.

Scaffold splitting is recommended for this dataset.

The raw data csv file contains columns below:

“mol_id” - PubChem CID of the compound
“smiles” - SMILES representation of the molecular structure
“MUV-XXX” - Measured results (Active/Inactive) for bioassays

Parameters:

featurizer (Featurizer or str) – the featurizer to use for processing the data. Alternatively you can pass one of the names from dc.molnet.featurizers as a shortcut.
splitter (Splitter or str) – the splitter to use for splitting the data into training, validation, and test sets. Alternatively you can pass one of the names from dc.molnet.splitters as a shortcut. If this is None, all the data will be included in a single dataset.
transformers (list of TransformerGenerators or strings) – the Transformers to apply to the data. Each one is specified by a TransformerGenerator or, as a shortcut, one of the names from dc.molnet.transformers.
reload (bool) – if True, the first call for a particular featurizer and splitter will cache the datasets to disk, and subsequent calls will reload the cached datasets.
data_dir (str) – a directory to save the raw data in
save_dir (str) – a directory to save the dataset in

References

NCI Datasets¶

load_nci(featurizer: Featurizer | str = 'ECFP', splitter: Splitter | str | None = 'random', transformers: List[TransformerGenerator | str] = ['normalization'], reload: bool = True, data_dir: str | None = None, save_dir: str | None = None, **kwargs) → Tuple[List[str], Tuple[Dataset, ...], List[Transformer]][source]¶

Load NCI dataset.

Parameters:

featurizer (Featurizer or str) – the featurizer to use for processing the data. Alternatively you can pass one of the names from dc.molnet.featurizers as a shortcut.
splitter (Splitter or str) – the splitter to use for splitting the data into training, validation, and test sets. Alternatively you can pass one of the names from dc.molnet.splitters as a shortcut. If this is None, all the data will be included in a single dataset.
transformers (list of TransformerGenerators or strings) – the Transformers to apply to the data. Each one is specified by a TransformerGenerator or, as a shortcut, one of the names from dc.molnet.transformers.
reload (bool) – if True, the first call for a particular featurizer and splitter will cache the datasets to disk, and subsequent calls will reload the cached datasets.
data_dir (str) – a directory to save the raw data in
save_dir (str) – a directory to save the dataset in

PCBA Datasets¶

load_pcba(featurizer: Featurizer | str = 'ECFP', splitter: Splitter | str | None = 'scaffold', transformers: List[TransformerGenerator | str] = ['balancing'], reload: bool = True, data_dir: str | None = None, save_dir: str | None = None, **kwargs) → Tuple[List[str], Tuple[Dataset, ...], List[Transformer]][source]¶

Load PCBA dataset

PubChem BioAssay (PCBA) is a database consisting of biological activities of small molecules generated by high-throughput screening. We use a subset of PCBA, containing 128 bioassays measured over 400 thousand compounds, used by previous work to benchmark machine learning methods.

Random splitting is recommended for this dataset.

The raw data csv file contains columns below:

“mol_id” - PubChem CID of the compound
“smiles” - SMILES representation of the molecular structure
“PCBA-XXX” - Measured results (Active/Inactive) for bioassays:
search for the assay ID at https://pubchem.ncbi.nlm.nih.gov/search/#collection=bioassays for details

Parameters:

featurizer (Featurizer or str) – the featurizer to use for processing the data. Alternatively you can pass one of the names from dc.molnet.featurizers as a shortcut.
splitter (Splitter or str) – the splitter to use for splitting the data into training, validation, and test sets. Alternatively you can pass one of the names from dc.molnet.splitters as a shortcut. If this is None, all the data will be included in a single dataset.
transformers (list of TransformerGenerators or strings) – the Transformers to apply to the data. Each one is specified by a TransformerGenerator or, as a shortcut, one of the names from dc.molnet.transformers.
reload (bool) – if True, the first call for a particular featurizer and splitter will cache the datasets to disk, and subsequent calls will reload the cached datasets.
data_dir (str) – a directory to save the raw data in
save_dir (str) – a directory to save the dataset in

References

PDBBIND Datasets¶

load_pdbbind(featurizer: ComplexFeaturizer, splitter: Splitter | str | None = 'random', transformers: List[TransformerGenerator | str] = ['normalization'], reload: bool = True, data_dir: str | None = None, save_dir: str | None = None, pocket: bool = True, set_name: str = 'core', **kwargs) → Tuple[List[str], Tuple[Dataset, ...], List[Transformer]][source]¶

Load PDBBind dataset.

The PDBBind dataset includes experimental binding affinity data and structures for 4852 protein-ligand complexes from the “refined set” and 12800 complexes from the “general set” in PDBBind v2019 and 193 complexes from the “core set” in PDBBind v2013. The refined set removes data with obvious problems in 3D structure, binding data, or other aspects and should therefore be a better starting point for docking/scoring studies. Details on the criteria used to construct the refined set can be found in [4]_. The general set does not include the refined set. The core set is a subset of the refined set that is not updated annually.

Random splitting is recommended for this dataset.

The raw dataset contains the columns below:

“ligand” - SDF of the molecular structure
“protein” - PDB of the protein structure
“CT_TOX” - Clinical trial results

Parameters:

featurizer (ComplexFeaturizer or str) – the complex featurizer to use for processing the data. Alternatively you can pass one of the names from dc.molnet.featurizers as a shortcut.
splitter (Splitter or str) – the splitter to use for splitting the data into training, validation, and test sets. Alternatively you can pass one of the names from dc.molnet.splitters as a shortcut. If this is None, all the data will be included in a single dataset.
transformers (list of TransformerGenerators or strings) – the Transformers to apply to the data. Each one is specified by a TransformerGenerator or, as a shortcut, one of the names from dc.molnet.transformers.
reload (bool) – if True, the first call for a particular featurizer and splitter will cache the datasets to disk, and subsequent calls will reload the cached datasets.
data_dir (str) – a directory to save the raw data in
save_dir (str) – a directory to save the dataset in
pocket (bool (default True)) – If true, use only the binding pocket for featurization.
set_name (str (default 'core')) – Name of dataset to download. ‘refined’, ‘general’, and ‘core’ are supported.

Returns:

tasks, datasets, transformers –

tasks: list: Column names corresponding to machine learning target variables.
datasets: tuple: train, validation, test splits of data as deepchem.data.datasets.Dataset instances.
transformers: list: deepchem.trans.transformers.Transformer instances applied to dataset.

Return type:

tuple

References

PPB Datasets¶

load_ppb(featurizer: Featurizer | str = 'ECFP', splitter: Splitter | str | None = 'scaffold', transformers: List[TransformerGenerator | str] = ['normalization'], reload: bool = True, data_dir: str | None = None, save_dir: str | None = None, **kwargs) → Tuple[List[str], Tuple[Dataset, ...], List[Transformer]][source]¶

Load PPB datasets.

Parameters:

featurizer (Featurizer or str) – the featurizer to use for processing the data. Alternatively you can pass one of the names from dc.molnet.featurizers as a shortcut.
splitter (Splitter or str) – the splitter to use for splitting the data into training, validation, and test sets. Alternatively you can pass one of the names from dc.molnet.splitters as a shortcut. If this is None, all the data will be included in a single dataset.
transformers (list of TransformerGenerators or strings) – the Transformers to apply to the data. Each one is specified by a TransformerGenerator or, as a shortcut, one of the names from dc.molnet.transformers.
reload (bool) – if True, the first call for a particular featurizer and splitter will cache the datasets to disk, and subsequent calls will reload the cached datasets.
data_dir (str) – a directory to save the raw data in
save_dir (str) – a directory to save the dataset in

QM7 Datasets¶

load_qm7(featurizer: ~deepchem.feat.base_classes.Featurizer | str = CoulombMatrix[max_atoms=23, remove_hydrogens=False, randomize=False, upper_tri=False, n_samples=1, seed=None], splitter: ~deepchem.splits.splitters.Splitter | str | None = 'random', transformers: ~typing.List[~deepchem.molnet.load_function.molnet_loader.TransformerGenerator | str] = ['normalization'], reload: bool = True, data_dir: str | None = None, save_dir: str | None = None, **kwargs) → Tuple[List[str], Tuple[Dataset, ...], List[Transformer]][source]¶

Load QM7 dataset

QM7 is a subset of GDB-13 (a database of nearly 1 billion stable and synthetically accessible organic molecules) containing up to 7 heavy atoms C, N, O, and S. The 3D Cartesian coordinates of the most stable conformations and their atomization energies were determined using ab-initio density functional theory (PBE0/tier2 basis set). This dataset also provided Coulomb matrices as calculated in [Rupp et al. PRL, 2012]:

Stratified splitting is recommended for this dataset.

The data file (.mat format, we recommend using scipy.io.loadmat for python users to load this original data) contains five arrays:

“X” - (7165 x 23 x 23), Coulomb matrices
“T” - (7165), atomization energies (unit: kcal/mol)
“P” - (5 x 1433), cross-validation splits as used in [Montavon et al.
NIPS, 2012]
“Z” - (7165 x 23), atomic charges
“R” - (7165 x 23 x 3), cartesian coordinate (unit: Bohr) of each atom in
the molecules

Parameters:

featurizer (Featurizer or str) – the featurizer to use for processing the data. Alternatively you can pass one of the names from dc.molnet.featurizers as a shortcut.
splitter (Splitter or str) – the splitter to use for splitting the data into training, validation, and test sets. Alternatively you can pass one of the names from dc.molnet.splitters as a shortcut. If this is None, all the data will be included in a single dataset.
transformers (list of TransformerGenerators or strings) – the Transformers to apply to the data. Each one is specified by a TransformerGenerator or, as a shortcut, one of the names from dc.molnet.transformers.
reload (bool) – if True, the first call for a particular featurizer and splitter will cache the datasets to disk, and subsequent calls will reload the cached datasets.
data_dir (str) – a directory to save the raw data in
save_dir (str) – a directory to save the dataset in

Note

DeepChem 2.4.0 has turned on sanitization for this dataset by default. For the QM7 dataset, this means that calling this function will return 6838 compounds instead of 7160 in the source dataset file. This appears to be due to valence specification mismatches in the dataset that weren’t caught in earlier more lax versions of RDKit. Note that this may subtly affect benchmarking results on this dataset.

The URL for qm7 dataset is updated to GDB7_V2_URL as the sdf file from GDB7_URL contains 4 additional molecules (containing 1 or 2 hydrogen atoms) which were not part of the original QM7 dataset that had only 7165 molecules.

References

QM8 Datasets¶

load_qm8(featurizer: ~deepchem.feat.base_classes.Featurizer | str = CoulombMatrix[max_atoms=26, remove_hydrogens=False, randomize=False, upper_tri=False, n_samples=1, seed=None], splitter: ~deepchem.splits.splitters.Splitter | str | None = 'random', transformers: ~typing.List[~deepchem.molnet.load_function.molnet_loader.TransformerGenerator | str] = ['normalization'], reload: bool = True, data_dir: str | None = None, save_dir: str | None = None, **kwargs) → Tuple[List[str], Tuple[Dataset, ...], List[Transformer]][source]¶

Load QM8 dataset

QM8 is the dataset used in a study on modeling quantum mechanical calculations of electronic spectra and excited state energy of small molecules. Multiple methods, including time-dependent density functional theories (TDDFT) and second-order approximate coupled-cluster (CC2), are applied to a collection of molecules that include up to eight heavy atoms (also a subset of the GDB-17 database). In our collection, there are four excited state properties calculated by four different methods on 22 thousand samples:

S0 -> S1 transition energy E1 and the corresponding oscillator strength f1

S0 -> S2 transition energy E2 and the corresponding oscillator strength f2

E1, E2, f1, f2 are in atomic units. f1, f2 are in length representation

Random splitting is recommended for this dataset.

The source data contain:

qm8.sdf: molecular structures
qm8.sdf.csv: tables for molecular properties
Column 1: Molecule ID (gdb9 index) mapping to the .sdf file
Columns 2-5: RI-CC2/def2TZVP
Columns 6-9: LR-TDPBE0/def2SVP
Columns 10-13: LR-TDPBE0/def2TZVP
Columns 14-17: LR-TDCAM-B3LYP/def2TZVP

Parameters:

featurizer (Featurizer or str) – the featurizer to use for processing the data. Alternatively you can pass one of the names from dc.molnet.featurizers as a shortcut.
splitter (Splitter or str) – the splitter to use for splitting the data into training, validation, and test sets. Alternatively you can pass one of the names from dc.molnet.splitters as a shortcut. If this is None, all the data will be included in a single dataset.
transformers (list of TransformerGenerators or strings) – the Transformers to apply to the data. Each one is specified by a TransformerGenerator or, as a shortcut, one of the names from dc.molnet.transformers.
reload (bool) – if True, the first call for a particular featurizer and splitter will cache the datasets to disk, and subsequent calls will reload the cached datasets.
data_dir (str) – a directory to save the raw data in
save_dir (str) – a directory to save the dataset in

Note

DeepChem 2.4.0 has turned on sanitization for this dataset by default. For the QM8 dataset, this means that calling this function will return 21747 compounds instead of 21786 in the source dataset file. This appears to be due to valence specification mismatches in the dataset that weren’t caught in earlier more lax versions of RDKit. Note that this may subtly affect benchmarking results on this dataset.

References

QM9 Datasets¶

A bug was reported in the issue https://github.com/deepchem/deepchem/issues/4413 in the previously included SDF files for the QM9 dataset, where some molecules incorrectly carried formal charges, despite QM9 molecules being charge-neutral. To address this, we now use the original QM9 XYZ files and convert them to SDF format using Open Babel, which preserves correct charge information. The updated SDF file is uploaded to the Deepchem S3 bucket.

Note:

1. Molecules such as gdb 24 previously contained incorrect formal charges on nitrogen atoms. This issue has been resolved in the latest SDF files by re-parsing the original _XYZ files files using the script deepchem/examples/qm9/qm9_data_preprocessing.py.

2. However, some molecules (e.g., gdb 21968) exhibit a discrepancy depending on the sanitization setting. When parsed with rdkit using sanitize=False, no atom is assigned a formal charge. In contrast, parsing the same molecule with sanitize=True` assigns formal charges to nitrogen and oxygen atoms, even though the molecule is neutral overall.

load_qm9(featurizer: ~deepchem.feat.base_classes.Featurizer | str = CoulombMatrix[max_atoms=29, remove_hydrogens=False, randomize=False, upper_tri=False, n_samples=1, seed=None], splitter: ~deepchem.splits.splitters.Splitter | str | None = 'random', transformers: ~typing.List[~deepchem.molnet.load_function.molnet_loader.TransformerGenerator | str] = ['normalization'], reload: bool = True, data_dir: str | None = None, save_dir: str | None = None, **kwargs) → Tuple[List[str], Tuple[Dataset, ...], List[Transformer]][source]¶

Load QM9 dataset

QM9 is a comprehensive dataset that provides geometric, energetic, electronic and thermodynamic properties for a subset of GDB-17 database, comprising 134 thousand stable organic molecules with up to 9 heavy atoms. All molecules are modeled using density functional theory (B3LYP/6-31G(2df,p) based DFT).

Random splitting is recommended for this dataset.

The source data contain:

qm9.sdf: molecular structures
qm9.sdf.csv: tables for molecular properties
“mol_id” - Molecule ID (gdb9 index) mapping to the .sdf file
“A” - Rotational constant (unit: GHz)
“B” - Rotational constant (unit: GHz)
“C” - Rotational constant (unit: GHz)
“mu” - Dipole moment (unit: D)
“alpha” - Isotropic polarizability (unit: Bohr^3)
“homo” - Highest occupied molecular orbital energy (unit: Hartree)
“lumo” - Lowest unoccupied molecular orbital energy (unit: Hartree)
“gap” - Gap between HOMO and LUMO (unit: Hartree)
“r2” - Electronic spatial extent (unit: Bohr^2)
“zpve” - Zero point vibrational energy (unit: Hartree)
“u0” - Internal energy at 0K (unit: Hartree)
“u298” - Internal energy at 298.15K (unit: Hartree)
“h298” - Enthalpy at 298.15K (unit: Hartree)
“g298” - Free energy at 298.15K (unit: Hartree)
“cv” - Heat capavity at 298.15K (unit: cal/(mol*K))
“u0_atom” - Atomization energy at 0K (unit: kcal/mol)
“u298_atom” - Atomization energy at 298.15K (unit: kcal/mol)
“h298_atom” - Atomization enthalpy at 298.15K (unit: kcal/mol)
“g298_atom” - Atomization free energy at 298.15K (unit: kcal/mol)

“u0_atom” ~ “g298_atom” (used in MoleculeNet) are calculated from the differences between “u0” ~ “g298” and sum of reference energies of all atoms in the molecules, as given in https://figshare.com/articles/Atomref%3A_Reference_thermochemical_energies_of_H%2C_C%2C_N%2C_O%2C_F_atoms./1057643

Parameters:

featurizer (Featurizer or str) – the featurizer to use for processing the data. Alternatively you can pass one of the names from dc.molnet.featurizers as a shortcut.
splitter (Splitter or str) – the splitter to use for splitting the data into training, validation, and test sets. Alternatively you can pass one of the names from dc.molnet.splitters as a shortcut. If this is None, all the data will be included in a single dataset.
transformers (list of TransformerGenerators or strings) – the Transformers to apply to the data. Each one is specified by a TransformerGenerator or, as a shortcut, one of the names from dc.molnet.transformers.
reload (bool) – if True, the first call for a particular featurizer and splitter will cache the datasets to disk, and subsequent calls will reload the cached datasets.
data_dir (str) – a directory to save the raw data in
save_dir (str) – a directory to save the dataset in

Note

DeepChem 2.4.0 has turned on sanitization for this dataset by default. For the QM9 dataset, this means that calling this function will return 132480 compounds instead of 133885 in the source dataset file. This appears to be due to valence specification mismatches in the dataset that weren’t caught in earlier more lax versions of RDKit. Note that this may subtly affect benchmarking results on this dataset.

References

SAMPL Datasets¶

load_sampl(featurizer: Featurizer | str = 'ECFP', splitter: Splitter | str | None = 'scaffold', transformers: List[TransformerGenerator | str] = ['normalization'], reload: bool = True, data_dir: str | None = None, save_dir: str | None = None, **kwargs) → Tuple[List[str], Tuple[Dataset, ...], List[Transformer]][source]¶

Load SAMPL(FreeSolv) dataset

The Free Solvation Database, FreeSolv(SAMPL), provides experimental and calculated hydration free energy of small molecules in water. The calculated values are derived from alchemical free energy calculations using molecular dynamics simulations. The experimental values are included in the benchmark collection.

Random splitting is recommended for this dataset.

The raw data csv file contains columns below:

“iupac” - IUPAC name of the compound
“smiles” - SMILES representation of the molecular structure
“expt” - Measured solvation energy (unit: kcal/mol) of the compound,
used as label
“calc” - Calculated solvation energy (unit: kcal/mol) of the compound

Parameters:

featurizer (Featurizer or str) – the featurizer to use for processing the data. Alternatively you can pass one of the names from dc.molnet.featurizers as a shortcut.
splitter (Splitter or str) – the splitter to use for splitting the data into training, validation, and test sets. Alternatively you can pass one of the names from dc.molnet.splitters as a shortcut. If this is None, all the data will be included in a single dataset.
transformers (list of TransformerGenerators or strings) – the Transformers to apply to the data. Each one is specified by a TransformerGenerator or, as a shortcut, one of the names from dc.molnet.transformers.
reload (bool) – if True, the first call for a particular featurizer and splitter will cache the datasets to disk, and subsequent calls will reload the cached datasets.
data_dir (str) – a directory to save the raw data in
save_dir (str) – a directory to save the dataset in

References

SIDER Datasets¶

load_sider(featurizer: Featurizer | str = 'ECFP', splitter: Splitter | str | None = 'scaffold', transformers: List[TransformerGenerator | str] = ['balancing'], reload: bool = True, data_dir: str | None = None, save_dir: str | None = None, **kwargs) → Tuple[List[str], Tuple[Dataset, ...], List[Transformer]][source]¶

Load SIDER dataset

The Side Effect Resource (SIDER) is a database of marketed drugs and adverse drug reactions (ADR). The version of the SIDER dataset in DeepChem has grouped drug side effects into 27 system organ classes following MedDRA classifications measured for 1427 approved drugs.

Random splitting is recommended for this dataset.

The raw data csv file contains columns below:

“smiles”: SMILES representation of the molecular structure
“Hepatobiliary disorders” ~ “Injury, poisoning and procedural
complications”: Recorded side effects for the drug. Please refer to http://sideeffects.embl.de/se/?page=98 for details on ADRs.

Parameters:

featurizer (Featurizer or str) – the featurizer to use for processing the data. Alternatively you can pass one of the names from dc.molnet.featurizers as a shortcut.
splitter (Splitter or str) – the splitter to use for splitting the data into training, validation, and test sets. Alternatively you can pass one of the names from dc.molnet.splitters as a shortcut. If this is None, all the data will be included in a single dataset.
transformers (list of TransformerGenerators or strings) – the Transformers to apply to the data. Each one is specified by a TransformerGenerator or, as a shortcut, one of the names from dc.molnet.transformers.
reload (bool) – if True, the first call for a particular featurizer and splitter will cache the datasets to disk, and subsequent calls will reload the cached datasets.
data_dir (str) – a directory to save the raw data in
save_dir (str) – a directory to save the dataset in

References

Thermosol Datasets¶

load_thermosol(featurizer: Featurizer | str = 'ECFP', splitter: Splitter | str | None = 'scaffold', transformers: List[TransformerGenerator | str] = [], reload: bool = True, data_dir: str | None = None, save_dir: str | None = None, **kwargs) → Tuple[List[str], Tuple[Dataset, ...], List[Transformer]][source]¶

Loads the thermodynamic solubility datasets.

Parameters:

featurizer (Featurizer or str) – the featurizer to use for processing the data. Alternatively you can pass one of the names from dc.molnet.featurizers as a shortcut.
splitter (Splitter or str) – the splitter to use for splitting the data into training, validation, and test sets. Alternatively you can pass one of the names from dc.molnet.splitters as a shortcut. If this is None, all the data will be included in a single dataset.
transformers (list of TransformerGenerators or strings) – the Transformers to apply to the data. Each one is specified by a TransformerGenerator or, as a shortcut, one of the names from dc.molnet.transformers.
reload (bool) – if True, the first call for a particular featurizer and splitter will cache the datasets to disk, and subsequent calls will reload the cached datasets.
data_dir (str) – a directory to save the raw data in
save_dir (str) – a directory to save the dataset in

Tox21 Datasets¶

load_tox21(featurizer: Featurizer | str = 'ECFP', splitter: Splitter | str | None = 'scaffold', transformers: List[TransformerGenerator | str] = ['balancing'], reload: bool = True, data_dir: str | None = None, save_dir: str | None = None, tasks: List[str] = ['NR-AR', 'NR-AR-LBD', 'NR-AhR', 'NR-Aromatase', 'NR-ER', 'NR-ER-LBD', 'NR-PPAR-gamma', 'SR-ARE', 'SR-ATAD5', 'SR-HSE', 'SR-MMP', 'SR-p53'], **kwargs) → Tuple[List[str], Tuple[Dataset, ...], List[Transformer]][source]¶

Load Tox21 dataset

The “Toxicology in the 21st Century” (Tox21) initiative created a public database measuring toxicity of compounds, which has been used in the 2014 Tox21 Data Challenge. This dataset contains qualitative toxicity measurements for 8k compounds on 12 different targets, including nuclear receptors and stress response pathways.

Random splitting is recommended for this dataset.

The raw data csv file contains columns below:

“smiles” - SMILES representation of the molecular structure
“NR-XXX” - Nuclear receptor signaling bioassays results
“SR-XXX” - Stress response bioassays results

please refer to https://tripod.nih.gov/tox21/challenge/data.jsp for details.

Parameters:

featurizer (Featurizer or str) – the featurizer to use for processing the data. Alternatively you can pass one of the names from dc.molnet.featurizers as a shortcut.
splitter (Splitter or str) – the splitter to use for splitting the data into training, validation, and test sets. Alternatively you can pass one of the names from dc.molnet.splitters as a shortcut. If this is None, all the data will be included in a single dataset.
transformers (list of TransformerGenerators or strings) – the Transformers to apply to the data. Each one is specified by a TransformerGenerator or, as a shortcut, one of the names from dc.molnet.transformers.
reload (bool) – if True, the first call for a particular featurizer and splitter will cache the datasets to disk, and subsequent calls will reload the cached datasets.
data_dir (str) – a directory to save the raw data in
save_dir (str) – a directory to save the dataset in
tasks (List[str], (optional)) – Specify the set of tasks to load. If no task is specified, then it loads
NR-AR (the default set of tasks which are) –
NR-AR-LBD –
NR-AhR –
NR-Aromatase –
NR-ER –

:param : :param NR-ER-LBD: :param NR-PPAR-gamma: :param SR-ARE: :param SR-ATAD5: :param SR-HSE: :param SR-MMP: :param SR-p53.:

References

Toxcast Datasets¶

load_toxcast(featurizer: Featurizer | str = 'ECFP', splitter: Splitter | str | None = 'scaffold', transformers: List[TransformerGenerator | str] = ['balancing'], reload: bool = True, data_dir: str | None = None, save_dir: str | None = None, **kwargs) → Tuple[List[str], Tuple[Dataset, ...], List[Transformer]][source]¶

Load Toxcast dataset

ToxCast is an extended data collection from the same initiative as Tox21, providing toxicology data for a large library of compounds based on in vitro high-throughput screening. The processed collection includes qualitative results of over 600 experiments on 8k compounds.

Random splitting is recommended for this dataset.

The raw data csv file contains columns below:

“smiles”: SMILES representation of the molecular structure
“ACEA_T47D_80hr_Negative” ~ “Tanguay_ZF_120hpf_YSE_up”: Bioassays results.
Please refer to the section “high-throughput assay information” at https://www.epa.gov/chemical-research/toxicity-forecaster-toxcasttm-data for details.

Parameters:

featurizer (Featurizer or str) – the featurizer to use for processing the data. Alternatively you can pass one of the names from dc.molnet.featurizers as a shortcut.
splitter (Splitter or str) – the splitter to use for splitting the data into training, validation, and test sets. Alternatively you can pass one of the names from dc.molnet.splitters as a shortcut. If this is None, all the data will be included in a single dataset.
transformers (list of TransformerGenerators or strings) – the Transformers to apply to the data. Each one is specified by a TransformerGenerator or, as a shortcut, one of the names from dc.molnet.transformers.
reload (bool) – if True, the first call for a particular featurizer and splitter will cache the datasets to disk, and subsequent calls will reload the cached datasets.
data_dir (str) – a directory to save the raw data in
save_dir (str) – a directory to save the dataset in

References

USPTO Datasets¶

load_uspto(featurizer: Featurizer | str = 'RxnFeaturizer', splitter: Splitter | str | None = None, transformers: List[TransformerGenerator | str] = [], reload: bool = True, data_dir: str | None = None, save_dir: str | None = None, subset: str = 'MIT', sep_reagent: bool = True, **kwargs) → Tuple[List[str], Tuple[Dataset, ...], List[Transformer]][source]¶

Load USPTO Datasets.

The USPTO dataset consists of over 1.8 Million organic chemical reactions extracted from US patents and patent applications. The dataset contains the reactions in the form of reaction SMILES, which have the general format: reactant>reagent>product.

Molnet provides ability to load subsets of the USPTO dataset namely MIT, STEREO and 50K. The MIT dataset contains around 479K reactions, curated by jin et al. The STEREO dataset contains around 1 Million Reactions, it does not have duplicates and the reactions include stereochemical information. The 50K dataset contatins 50,000 reactions and is the benchmark for retrosynthesis predictions. The reactions are additionally classified into 10 reaction classes. The canonicalized version of the dataset used by the loader is the same as that used by Somnath et. al.

The loader uses the SpecifiedSplitter to use the same splits as specified by Schwaller et. al and Dai et. al. Custom splitters could also be used. There is a toggle in the loader to skip the source/target transformation needed for seq2seq tasks. There is an additional toggle to load the dataset with the reagents and reactants separated or mixed. This alters the entries in source by replacing the ‘>’ with ‘.’ , effectively loading them as an unified SMILES string.

Parameters:

featurizer (Featurizer or str) – the featurizer to use for processing the data. Alternatively you can pass one of the names from dc.molnet.featurizers as a shortcut.
splitter (Splitter or str) – the splitter to use for splitting the data into training, validation, and test sets. Alternatively you can pass one of the names from dc.molnet.splitters as a shortcut. If this is None, all the data will be included in a single dataset.
transformers (list of TransformerGenerators or strings) – the Transformers to apply to the data. Each one is specified by a TransformerGenerator or, as a shortcut, one of the names from dc.molnet.transformers.
reload (bool) – if True, the first call for a particular featurizer and splitter will cache the datasets to disk, and subsequent calls will reload the cached datasets.
data_dir (str) – a directory to save the raw data in
save_dir (str) – a directory to save the dataset in
subset (str (default 'MIT')) – Subset of dataset to download. ‘FULL’, ‘MIT’, ‘STEREO’, and ‘50K’ are supported.
sep_reagent (bool (default True)) – Toggle to load dataset with reactants and reagents either separated or mixed.
skip_transform (bool (default True)) – Toggle to skip the source/target transformation.

Returns:

tasks, datasets, transformers –

taskslist: Column names corresponding to machine learning target variables.
datasetstuple: train, validation, test splits of data as deepchem.data.datasets.Dataset instances.
transformerslist: deepchem.trans.transformers.Transformer instances applied to dataset.

Return type:

tuple

References

UV Datasets¶

load_uv(shard_size=2000, featurizer=None, split=None, reload=True)[source]¶

Load UV dataset; does not do train/test split

The UV dataset is an in-house dataset from Merck that was first introduced in the following paper: Ramsundar, Bharath, et al. “Is multitask deep learning practical for pharma?.” Journal of chemical information and modeling 57.8 (2017): 2068-2076.

The UV dataset tests 10,000 of Merck’s internal compounds on 190 absorption wavelengths between 210 and 400 nm. Unlike most of the other datasets featured in MoleculeNet, the UV collection does not have structures for the compounds tested since they were proprietary Merck compounds. However, the collection does feature pre-computed descriptors for these compounds.

Note that the original train/valid/test split from the source data was preserved here, so this function doesn’t allow for alternate modes of splitting. Similarly, since the source data came pre-featurized, it is not possible to apply alternative featurizations.

Parameters:

shard_size (int, optional) – Size of the DiskDataset shards to write on disk
featurizer (optional) – Ignored since featurization pre-computed
split (optional) – Ignored since split pre-computed
reload (bool, optional) – Whether to automatically re-load from disk

ZINC15 Datasets¶

load_zinc15(featurizer: Featurizer | str = 'OneHot', splitter: Splitter | str | None = 'random', transformers: List[TransformerGenerator | str] = ['normalization'], reload: bool = True, data_dir: str | None = None, save_dir: str | None = None, dataset_size: str = '250K', dataset_dimension: str = '2D', tasks: List[str] = ['mwt', 'logp', 'reactive'], **kwargs) → Tuple[List[str], Tuple[Dataset, ...], List[Transformer]][source]¶

Load zinc15.

ZINC15 is a dataset of over 230 million purchasable compounds for virtual screening of small molecules to identify structures that are likely to bind to drug targets. ZINC15 data is currently available in 2D (SMILES string) format.

MolNet provides subsets of 250K, 1M, and 10M “lead-like” compounds from ZINC15. The full dataset of 270M “goldilocks” compounds is also available. Compounds in ZINC15 are labeled by their molecular weight and LogP (solubility) values. Each compound also has information about how readily available (purchasable) it is and its reactivity. Lead-like compounds have molecular weight between 300 and 350 Daltons and LogP between -1 and 3.5. Goldilocks compounds are lead-like compounds with LogP values further restricted to between 2 and 3.

If reload = True and data_dir (save_dir) is specified, the loader will attempt to load the raw dataset (featurized dataset) from disk. Otherwise, the dataset will be downloaded from the DeepChem AWS bucket.

For more information on ZINC15, please see [1]_ and https://zinc15.docking.org/.

Parameters:

featurizer (Featurizer or str) – the featurizer to use for processing the data. Alternatively you can pass one of the names from dc.molnet.featurizers as a shortcut.
splitter (Splitter or str) – the splitter to use for splitting the data into training, validation, and test sets. Alternatively you can pass one of the names from dc.molnet.splitters as a shortcut. If this is None, all the data will be included in a single dataset.
transformers (list of TransformerGenerators or strings) – the Transformers to apply to the data. Each one is specified by a TransformerGenerator or, as a shortcut, one of the names from dc.molnet.transformers.
reload (bool) – if True, the first call for a particular featurizer and splitter will cache the datasets to disk, and subsequent calls will reload the cached datasets.
data_dir (str) – a directory to save the raw data in
save_dir (str) – a directory to save the dataset in
size (str (default '250K')) – Size of dataset to download. ‘250K’, ‘1M’, ‘10M’, and ‘270M’ are supported.
format (str (default '2D')) – Format of data to download. 2D SMILES strings or 3D SDF files.
tasks (List[str], (optional) default: [‘molwt’, ‘logp’, ‘reactive’]) – Specify the set of tasks to load. If no task is specified, then it loads
molwt (the default set of tasks which are) –
logp –
reactive. –

Returns:

tasks, datasets, transformers –

taskslist: Column names corresponding to machine learning target variables.
datasetstuple: train, validation, test splits of data as deepchem.data.datasets.Dataset instances.
transformerslist: deepchem.trans.transformers.Transformer instances applied to dataset.

Return type:

tuple

Notes

The total ZINC dataset with SMILES strings contains hundreds of millions of compounds and is over 100GB! ZINC250K is recommended for experimentation. The full set of 270M goldilocks compounds is 23GB.

References

Platinum Adsorption Dataset¶

load_Platinum_Adsorption(featurizer: ~deepchem.feat.base_classes.Featurizer | str = SineCoulombMatrix[max_atoms=100, flatten=True], splitter: ~deepchem.splits.splitters.Splitter | str | None = 'random', transformers: ~typing.List[~deepchem.molnet.load_function.molnet_loader.TransformerGenerator | str] = [], reload: bool = True, data_dir: str | None = None, save_dir: str | None = None, **kwargs) → Tuple[List[str], Tuple[Dataset, ...], List[Transformer]][source]¶

Load Platinum Adsorption Dataset

The dataset consist of diffrent configurations of Adsorbates (i.e N and NO) on Platinum surface represented as Lattice and their formation energy. There are 648 diffrent adsorbate configuration in this datasets represented as Pymatgen Structure objects.

Pymatgen structure object with site_properties with following key value.
- “SiteTypes”, mentioning if it is a active site “A1” or spectator
  site “S1”.
- “oss”, diffrent occupational sites. For spectator sites make it -1.

Parameters:

featurizer (Featurizer (default LCNNFeaturizer)) – the featurizer to use for processing the data. Reccomended to use the LCNNFeaturiser.
splitter (Splitter (default RandomSplitter)) – the splitter to use for splitting the data into training, validation, and test sets. Alternatively you can pass one of the names from dc.molnet.splitters as a shortcut. If this is None, all the data will be included in a single dataset.
transformers (list of TransformerGenerators or strings. the Transformers to) – apply to the data and appropritate featuriser. Does’nt require any transformation for LCNN_featuriser
reload (bool) – if True, the first call for a particular featurizer and splitter will cache the datasets to disk, and subsequent calls will reload the cached datasets.
data_dir (str) – a directory to save the raw data in
save_dir (str, optional (default None)) – a directory to save the dataset in

References

Examples

>>>
>> import deepchem as dc
>> tasks, datasets, transformers = load_Platinum_Adsorption(
>>    reload=True,
>>    data_dir=data_path,
>>    save_dir=data_path,
>>    featurizer_kwargs=feat_args)
>> train_dataset, val_dataset, test_dataset = datasets