MoleculeNet

The DeepChem library is packaged alongside the MoleculeNet suite of datasets. One of the most important parts of machine learning applications is finding a suitable dataset. The MoleculeNet suite has curated a whole range of datasets and loaded them into DeepChem dc.data.Dataset objects for convenience.

MoleculeNet Cheatsheet

When training a model or performing a benchmark, the user needs specific datasets. However, at the beginning, this search can be exhaustive and confusing. The following cheatsheet is aimed at helping DeepChem users identify more easily which dataset to use depending on their purposes.

Each row reprents a dataset where a brief description is given. Also, the columns represents the type of the data; depending on molecule properties, images or materials and how many data points they have. Each dataset is referenced with a link of the paper. Finally, there are some entries that need further information.

Cheatsheet

MoleculeNet description

Name

Description

Type

Data Points

Reference

BACE (Regression)

Provides bindings results for a set of inhibitors of human beta-secretase (BACE-1)

Molecules

1513

ref

BACE (Classification)

Provides bindings results for a set of inhibitors of human beta-secretase (BACE-1)

Molecules

1513

ref

BBBC (BBBC001)

Images of HT29 colon cancer cells

Images

6

ref

BBBC (BBBC002)

Images of Drosophilia Kc167 cells

Images

50

ref

BBBC (BBBC004)

Synthetic Images of clustered nuclei

20

ref <https://data.broadinstitute.org/bbbc/BBBC004/>`_

BBBP

Blood-Brain Barrier Penetration designed for the modeling and prediction of barrier permeability

Binary labels on permeability properties

2000

ref

Cell Counting

Synthetic emulations of fluorescence microscopic images of bacterial cells

Images

200

ref

ChEMBL (set = ‘sparse’)

A sparse subset of ChEMBL with activity data for one target

Molecules

244 245

ref

ChEMBL (set = ‘5thresh’)

A subset of ChEMBL with activity data for at least five targets

Molecules

23 871

ref

ChEMBL25

Molecules

ref

Clearance

ref

Clintox

Compares drugs approved by the FDA and drugs that have failed clinical trials for toxicity reasons.

Molecules

1491

ref

Delaney

A regression dataset containing structures and water solubility data

Molecules

1128

ref

Factors

Merck in-house compounds that were measured for IC50 of inhibition on 12 serine proteases

Molecules

1500

Freesolv

A collection of experimental and calculated hydration free energies for small molecules in water

Molecules

643

ref

HIV

A dataset wich tested the ability to inhibit HIV replication

Molecules

40 000

ref

HOPV

Harvard Organic Photovoltaic dataset utilized as p-type materials

Molecules

350

HPPB

Thermosynamic solubility datasets

KAGGLE

in-house compounds that were measured on 15 enzyme inhibition and ADME/TOX datasets.

Molecules

100 000

ref

KINASE

In-house compounds that were measured for IC50 of inhibition on 99 protein kinases

Molecules

2 500

LIPO

Experimental results of octanol/water distribution coefficient (logD at pH 7.4)

Molecules

4 200

ref

Band Gap

Experimentally measured band gaps for inorganic crystal structure

Materials

4 604

ref

Perovskite

Contains Perovskite structures and their formation energies

Materials

18 928

ref

MP Formation Energy

Contains calculated formation energies and inorganic crystal structures from the Materials Project database

Materials

132 752

ref

MP Metallicity

Contains inorganic crystal structures from the Materials Project database labeled as metals or nonmetals

Materials

106 113

ref

MUV

Benchmark dataset selected from PubChem BioAssay by applying a refined nearest neighbor analysis

Molecules

90 000

ref

NCI

PCBA

Database consisting of biological activities of small molecules generated by high-throughput screening

Molecules

400 000

ref

PDBBIND

Experimental binding affinity data and structures of protein-ligand complexes

Molecules

“refined set”  4 852 - “general set” 12 800 - “core set” 193

ref

PPB

QM7

Subset of GDB-13  containing up to 7 heavy atoms CNOS

Molecules

7 165

ref

QM8

Dataset used in a study on modeling quantum mechanical calculations of electronic spectra and excited state energy of small molecules

Molecules

20 000

ref

QM9

Dataset that provides geometric/energetic/electronic and thermodynamic properties for a subset of GDB-17 database

Molecules

134 000

ref

SAMPL

Similat to FreeSolv dataset which provides experimental and calculated hydration free energy of small molecules in water

SIDER

The Side Effect Resource (SIDER) is a database of marketed drugs and adverse drug reactions (ADR)

Molecules

1 427

ref

Thermosol

Thermodynamic solubility datasets

Tox21

The “Toxicology in the 21st Century” (Tox21) initiative created a public database measuring the toxicity of compounds

Molecules

8 000

ref

Toxcast

Toxicology data for an extensive library of compounds based on in vitro high-throughput screening

Molecules

8 000

ref

USPTO

Subsets of USPTO dataset of organic chemical reactions extracted from US patents and patent applications

Chemical reactions SMILES

MIT 479 000 - STEREO 1 M - 50K 50 000

ref

UV

The UV dataset tests Merck’s internal compounds on 190 absorption wavelengths between 210 and 400 nm

Molecules

10 000

ZINC15

Purchasable compounds for virtual screening of small molecules to identify structures that are likely to bind to drug targets

Molecules

250K - 1M - 10M

ref

Platinum Adsorption

Different configurations of Adsorbates (i.e N and NO) on Platinum surface represented as Lattice and their formation energy

Adsorbate Configurations

648

Contributing a new dataset to MoleculeNet

If you are proposing a new dataset to be included in the MoleculeNet benchmarking suite, please follow the instructions below. Please review the datasets already available in MolNet before contributing.

  1. Read the Contribution guidelines.

  2. Open an issue to discuss the dataset you want to add to MolNet.

  3. Write a DatasetLoader class that inherits from deepchem.molnet.load_function.molnet_loader._MolnetLoader and implements a create_dataset method. See the _QM9Loader for a simple example.

  4. Write a load_dataset function that documents the dataset and add your load function to deepchem.molnet.__init__.py for easy importing.

  5. Prepare your dataset as a .tar.gz or .zip file. Accepted filetypes include CSV, JSON, and SDF.

  6. Ask a member of the technical steering committee to add your .tar.gz or .zip file to the DeepChem AWS bucket. Modify your load function to pull down the dataset from AWS.

  7. Add documentation for your loader to the MoleculeNet docs.

  8. Submit a [WIP] PR (Work in progress pull request) following the PR template.

Example Usage

Below is an example of how to load a MoleculeNet dataset and featurizer. This approach will work for any dataset in MoleculeNet by changing the load function and featurizer. For more details on the featurizers, see the Featurizers section.

import deepchem as dc
from deepchem.feat.molecule_featurizers import MolGraphConvFeaturizer

featurizer = MolGraphConvFeaturizer(use_edges=True)
dataset_dc = dc.molnet.load_qm9(featurizer=featurizer)
tasks, dataset, transformers = dataset_dc
train, valid, test = dataset

x,y,w,ids = train.X, train.y, train.w, train.ids

Note that the “w” matrix represents the weight of each sample. Some assays may have missing values, in which case the weight is 0. Otherwise, the weight is 1.

Additionally, the environment variable DEEPCHEM_DATA_DIR can be set like os.environ['DEEPCHEM_DATA_DIR'] = path/to/store/featurized/dataset. When the DEEPCHEM_DATA_DIR environment variable is set, molnet loader stores the featurized dataset in the specified directory and when the dataset has to be reloaded the next time, it will be fetched from the data directory directly rather than featurizing the raw dataset from scratch.

BACE Dataset

load_bace_classification(featurizer: Featurizer | str = 'ECFP', splitter: Splitter | str | None = 'scaffold', transformers: List[TransformerGenerator | str] = ['balancing'], reload: bool = True, data_dir: str | None = None, save_dir: str | None = None, **kwargs) Tuple[List[str], Tuple[Dataset, ...], List[Transformer]][source]

Load BACE dataset, classification labels

BACE dataset with classification labels (“class”).

Parameters:
  • featurizer (Featurizer or str) – the featurizer to use for processing the data. Alternatively you can pass one of the names from dc.molnet.featurizers as a shortcut.

  • splitter (Splitter or str) – the splitter to use for splitting the data into training, validation, and test sets. Alternatively you can pass one of the names from dc.molnet.splitters as a shortcut. If this is None, all the data will be included in a single dataset.

  • transformers (list of TransformerGenerators or strings) – the Transformers to apply to the data. Each one is specified by a TransformerGenerator or, as a shortcut, one of the names from dc.molnet.transformers.

  • reload (bool) – if True, the first call for a particular featurizer and splitter will cache the datasets to disk, and subsequent calls will reload the cached datasets.

  • data_dir (str) – a directory to save the raw data in

  • save_dir (str) – a directory to save the dataset in

load_bace_regression(featurizer: Featurizer | str = 'ECFP', splitter: Splitter | str | None = 'scaffold', transformers: List[TransformerGenerator | str] = ['normalization'], reload: bool = True, data_dir: str | None = None, save_dir: str | None = None, **kwargs) Tuple[List[str], Tuple[Dataset, ...], List[Transformer]][source]

Load BACE dataset, regression labels

The BACE dataset provides quantitative IC50 and qualitative (binary label) binding results for a set of inhibitors of human beta-secretase 1 (BACE-1).

All data are experimental values reported in scientific literature over the past decade, some with detailed crystal structures available. A collection of 1522 compounds is provided, along with the regression labels of IC50.

Scaffold splitting is recommended for this dataset.

The raw data csv file contains columns below:

  • “mol” - SMILES representation of the molecular structure

  • “pIC50” - Negative log of the IC50 binding affinity

  • “class” - Binary labels for inhibitor

Parameters:
  • featurizer (Featurizer or str) – the featurizer to use for processing the data. Alternatively you can pass one of the names from dc.molnet.featurizers as a shortcut.

  • splitter (Splitter or str) – the splitter to use for splitting the data into training, validation, and test sets. Alternatively you can pass one of the names from dc.molnet.splitters as a shortcut. If this is None, all the data will be included in a single dataset.

  • transformers (list of TransformerGenerators or strings) – the Transformers to apply to the data. Each one is specified by a TransformerGenerator or, as a shortcut, one of the names from dc.molnet.transformers.

  • reload (bool) – if True, the first call for a particular featurizer and splitter will cache the datasets to disk, and subsequent calls will reload the cached datasets.

  • data_dir (str) – a directory to save the raw data in

  • save_dir (str) – a directory to save the dataset in

References

BBBC Datasets

load_bbbc001(splitter: Splitter | str | None = 'index', transformers: List[TransformerGenerator | str] = [], reload: bool = True, data_dir: str | None = None, save_dir: str | None = None, **kwargs) Tuple[List[str], Tuple[Dataset, ...], List[Transformer]][source]

Load BBBC001 dataset

This dataset contains 6 images of human HT29 colon cancer cells. The task is to learn to predict the cell counts in these images. This dataset is too small to serve to train algorithms, but might serve as a good test dataset. https://data.broadinstitute.org/bbbc/BBBC001/

Parameters:
  • splitter (Splitter or str) – the splitter to use for splitting the data into training, validation, and test sets. Alternatively you can pass one of the names from dc.molnet.splitters as a shortcut. If this is None, all the data will be included in a single dataset.

  • transformers (list of TransformerGenerators or strings) – the Transformers to apply to the data. Each one is specified by a TransformerGenerator or, as a shortcut, one of the names from dc.molnet.transformers.

  • reload (bool) – if True, the first call for a particular featurizer and splitter will cache the datasets to disk, and subsequent calls will reload the cached datasets.

  • data_dir (str) – a directory to save the raw data in

  • save_dir (str) – a directory to save the dataset in

load_bbbc002(splitter: Splitter | str | None = 'index', transformers: List[TransformerGenerator | str] = [], reload: bool = True, data_dir: str | None = None, save_dir: str | None = None, **kwargs) Tuple[List[str], Tuple[Dataset, ...], List[Transformer]][source]

Load BBBC002 dataset

This dataset contains data corresponding to 5 samples of Drosophilia Kc167 cells. There are 10 fields of view for each sample, each an image of size 512x512. Ground truth labels contain cell counts for this dataset. Full details about this dataset are present at https://data.broadinstitute.org/bbbc/BBBC002/.

Parameters:
  • splitter (Splitter or str) – the splitter to use for splitting the data into training, validation, and test sets. Alternatively you can pass one of the names from dc.molnet.splitters as a shortcut. If this is None, all the data will be included in a single dataset.

  • transformers (list of TransformerGenerators or strings) – the Transformers to apply to the data. Each one is specified by a TransformerGenerator or, as a shortcut, one of the names from dc.molnet.transformers.

  • reload (bool) – if True, the first call for a particular featurizer and splitter will cache the datasets to disk, and subsequent calls will reload the cached datasets.

  • data_dir (str) – a directory to save the raw data in

  • save_dir (str) – a directory to save the dataset in

load_bbbc004(overlap_probability: float = 0.0, load_segmentation_mask: bool = False, splitter: Splitter | str | None = 'index', transformers: List[TransformerGenerator | str] = [], reload: bool = True, data_dir: str | None = None, save_dir: str | None = None, **kwargs) Tuple[List[str], Tuple[Dataset, ...], List[Transformer]][source]

Load BBBC004 dataset

This dataset contains data corresponding to 20 samples of synthetically generated fluorescent cell population images. There are 300 cells in each sample, each an image of size 950x950. Ground truth labels contain cell counts and segmentation masks for this dataset. Full details about this dataset are present at https://data.broadinstitute.org/bbbc/BBBC004/.

Parameters:
  • overlap_probability (float from list {0.0, 0.15, 0.3, 0.45, 0.6}) – the overlap probability of the synthetic cells in the images

  • load_segmentation_mask (bool) – if True, the dataset will contain segmentation masks as labels. Otherwise, the dataset will contain cell counts as labels.

  • splitter (Splitter or str) – the splitter to use for splitting the data into training, validation, and test sets. Alternatively you can pass one of the names from dc.molnet.splitters as a shortcut. If this is None, all the data will be included in a single dataset.

  • transformers (list of TransformerGenerators or strings) – the Transformers to apply to the data. Each one is specified by a TransformerGenerator or, as a shortcut, one of the names from dc.molnet.transformers.

  • reload (bool) – if True, the first call for a particular featurizer and splitter will cache the datasets to disk, and subsequent calls will reload the cached datasets.

  • data_dir (str) – a directory to save the raw data in

  • save_dir (str) – a directory to save the dataset in

Examples

Importing necessary modules

>> import deepchem as dc >> import numpy as np

We can load the BBBC004 dataset with 2 types of labels: segmentation masks and cell counts. We will first load the dataset with cell counts as labels.

>> loader = dc.molnet.load_bbbc004(overlap_probability=0.0, load_segmentation_mask=False) >> tasks, dataset, transformers = loader >> train, val, test = dataset

We now have a dataset with 20 samples, each with 300 cells. The images are of size 950x950. The labels are cell counts. We can verify this as follows:

>> train.X.shape (16, 950, 950) >> train.y.shape (16,)

We will now load the dataset with segmentation masks as labels.

>> loader = dc.molnet.load_bbbc004(overlap_probability=0.0, load_segmentation_mask=True) >> tasks, dataset, transformers = loader >> train, val, test = dataset

We now have a dataset with 20 samples, each with 300 cells. The images are of size 950x950. The labels are segmentation masks. We can verify this as follows:

>> train.X.shape (16, 950, 950) >> train.y.shape (16, 950, 950, 3)

BBBP Datasets

BBBP stands for Blood-Brain-Barrier Penetration

load_bbbp(featurizer: Featurizer | str = 'ECFP', splitter: Splitter | str | None = 'scaffold', transformers: List[TransformerGenerator | str] = ['balancing'], reload: bool = True, data_dir: str | None = None, save_dir: str | None = None, **kwargs) Tuple[List[str], Tuple[Dataset, ...], List[Transformer]][source]

Load BBBP dataset

The blood-brain barrier penetration (BBBP) dataset is designed for the modeling and prediction of barrier permeability. As a membrane separating circulating blood and brain extracellular fluid, the blood-brain barrier blocks most drugs, hormones and neurotransmitters. Thus penetration of the barrier forms a long-standing issue in development of drugs targeting central nervous system.

This dataset includes binary labels for over 2000 compounds on their permeability properties.

Scaffold splitting is recommended for this dataset.

The raw data csv file contains columns below:

  • “name” - Name of the compound

  • “smiles” - SMILES representation of the molecular structure

  • “p_np” - Binary labels for penetration/non-penetration

Parameters:
  • featurizer (Featurizer or str) – the featurizer to use for processing the data. Alternatively you can pass one of the names from dc.molnet.featurizers as a shortcut.

  • splitter (Splitter or str) – the splitter to use for splitting the data into training, validation, and test sets. Alternatively you can pass one of the names from dc.molnet.splitters as a shortcut. If this is None, all the data will be included in a single dataset.

  • transformers (list of TransformerGenerators or strings) – the Transformers to apply to the data. Each one is specified by a TransformerGenerator or, as a shortcut, one of the names from dc.molnet.transformers.

  • reload (bool) – if True, the first call for a particular featurizer and splitter will cache the datasets to disk, and subsequent calls will reload the cached datasets.

  • data_dir (str) – a directory to save the raw data in

  • save_dir (str) – a directory to save the dataset in

References

Cell Counting Datasets

load_cell_counting(splitter: Splitter | str | None = None, transformers: List[TransformerGenerator | str] = [], reload: bool = True, data_dir: str | None = None, save_dir: str | None = None, **kwargs) Tuple[List[str], Tuple[Dataset, ...], List[Transformer]][source]

Load Cell Counting dataset.

Loads the cell counting dataset from http://www.robots.ox.ac.uk/~vgg/research/counting/index_org.html.

Parameters:
  • splitter (Splitter or str) – the splitter to use for splitting the data into training, validation, and test sets. Alternatively you can pass one of the names from dc.molnet.splitters as a shortcut. If this is None, all the data will be included in a single dataset.

  • transformers (list of TransformerGenerators or strings) – the Transformers to apply to the data. Each one is specified by a TransformerGenerator or, as a shortcut, one of the names from dc.molnet.transformers.

  • reload (bool) – if True, the first call for a particular featurizer and splitter will cache the datasets to disk, and subsequent calls will reload the cached datasets.

  • data_dir (str) – a directory to save the raw data in

  • save_dir (str) – a directory to save the dataset in

Chembl Datasets

load_chembl(featurizer: Featurizer | str = 'ECFP', splitter: Splitter | str | None = 'scaffold', transformers: List[TransformerGenerator | str] = ['normalization'], set: str = '5thresh', reload: bool = True, data_dir: str | None = None, save_dir: str | None = None, **kwargs) Tuple[List[str], Tuple[Dataset, ...], List[Transformer]][source]

Load the ChEMBL dataset.

This dataset is based on release 22.1 of the data from https://www.ebi.ac.uk/chembl/. Two subsets of the data are available, depending on the “set” argument. “sparse” is a large dataset with 244,245 compounds. As the name suggests, the data is extremely sparse, with most compounds having activity data for only one target. “5thresh” is a much smaller set (23,871 compounds) that includes only compounds with activity data for at least five targets.

Parameters:
  • featurizer (Featurizer or str) – the featurizer to use for processing the data. Alternatively you can pass one of the names from dc.molnet.featurizers as a shortcut.

  • splitter (Splitter or str) – the splitter to use for splitting the data into training, validation, and test sets. Alternatively you can pass one of the names from dc.molnet.splitters as a shortcut. If this is None, all the data will be included in a single dataset.

  • transformers (list of TransformerGenerators or strings) – the Transformers to apply to the data. Each one is specified by a TransformerGenerator or, as a shortcut, one of the names from dc.molnet.transformers.

  • set (str) – the subset to load, either “sparse” or “5thresh”

  • reload (bool) – if True, the first call for a particular featurizer and splitter will cache the datasets to disk, and subsequent calls will reload the cached datasets.

  • data_dir (str) – a directory to save the raw data in

  • save_dir (str) – a directory to save the dataset in

Chembl25 Datasets

load_chembl25(featurizer: Featurizer | str = 'ECFP', splitter: Splitter | str | None = 'scaffold', transformers: List[TransformerGenerator | str] = ['normalization'], reload: bool = True, data_dir: str | None = None, save_dir: str | None = None, **kwargs) Tuple[List[str], Tuple[Dataset, ...], List[Transformer]][source]

Loads the ChEMBL25 dataset, featurizes it, and does a split.

Parameters:
  • featurizer (Featurizer or str) – the featurizer to use for processing the data. Alternatively you can pass one of the names from dc.molnet.featurizers as a shortcut.

  • splitter (Splitter or str) – the splitter to use for splitting the data into training, validation, and test sets. Alternatively you can pass one of the names from dc.molnet.splitters as a shortcut. If this is None, all the data will be included in a single dataset.

  • transformers (list of TransformerGenerators or strings) – the Transformers to apply to the data. Each one is specified by a TransformerGenerator or, as a shortcut, one of the names from dc.molnet.transformers.

  • reload (bool) – if True, the first call for a particular featurizer and splitter will cache the datasets to disk, and subsequent calls will reload the cached datasets.

  • data_dir (str) – a directory to save the raw data in

  • save_dir (str) – a directory to save the dataset in

Clearance Datasets

load_clearance(featurizer: Featurizer | str = 'ECFP', splitter: Splitter | str | None = 'scaffold', transformers: List[TransformerGenerator | str] = ['log'], reload: bool = True, data_dir: str | None = None, save_dir: str | None = None, **kwargs) Tuple[List[str], Tuple[Dataset, ...], List[Transformer]][source]

Load clearance datasets.

Parameters:
  • featurizer (Featurizer or str) – the featurizer to use for processing the data. Alternatively you can pass one of the names from dc.molnet.featurizers as a shortcut.

  • splitter (Splitter or str) – the splitter to use for splitting the data into training, validation, and test sets. Alternatively you can pass one of the names from dc.molnet.splitters as a shortcut. If this is None, all the data will be included in a single dataset.

  • transformers (list of TransformerGenerators or strings) – the Transformers to apply to the data. Each one is specified by a TransformerGenerator or, as a shortcut, one of the names from dc.molnet.transformers.

  • reload (bool) – if True, the first call for a particular featurizer and splitter will cache the datasets to disk, and subsequent calls will reload the cached datasets.

  • data_dir (str) – a directory to save the raw data in

  • save_dir (str) – a directory to save the dataset in

Clintox Datasets

load_clintox(featurizer: Featurizer | str = 'ECFP', splitter: Splitter | str | None = 'scaffold', transformers: List[TransformerGenerator | str] = ['balancing'], reload: bool = True, data_dir: str | None = None, save_dir: str | None = None, **kwargs) Tuple[List[str], Tuple[Dataset, ...], List[Transformer]][source]

Load ClinTox dataset

The ClinTox dataset compares drugs approved by the FDA and drugs that have failed clinical trials for toxicity reasons. The dataset includes two classification tasks for 1491 drug compounds with known chemical structures:

  1. clinical trial toxicity (or absence of toxicity)

  2. FDA approval status.

List of FDA-approved drugs are compiled from the SWEETLEAD database, and list of drugs that failed clinical trials for toxicity reasons are compiled from the Aggregate Analysis of ClinicalTrials.gov(AACT) database.

Random splitting is recommended for this dataset.

The raw data csv file contains columns below:

  • “smiles” - SMILES representation of the molecular structure

  • “FDA_APPROVED” - FDA approval status

  • “CT_TOX” - Clinical trial results

Parameters:
  • featurizer (Featurizer or str) – the featurizer to use for processing the data. Alternatively you can pass one of the names from dc.molnet.featurizers as a shortcut.

  • splitter (Splitter or str) – the splitter to use for splitting the data into training, validation, and test sets. Alternatively you can pass one of the names from dc.molnet.splitters as a shortcut. If this is None, all the data will be included in a single dataset.

  • transformers (list of TransformerGenerators or strings) – the Transformers to apply to the data. Each one is specified by a TransformerGenerator or, as a shortcut, one of the names from dc.molnet.transformers.

  • reload (bool) – if True, the first call for a particular featurizer and splitter will cache the datasets to disk, and subsequent calls will reload the cached datasets.

  • data_dir (str) – a directory to save the raw data in

  • save_dir (str) – a directory to save the dataset in

References

Delaney Datasets

load_delaney(featurizer: Featurizer | str = 'ECFP', splitter: Splitter | str | None = 'scaffold', transformers: List[TransformerGenerator | str] = ['normalization'], reload: bool = True, data_dir: str | None = None, save_dir: str | None = None, **kwargs) Tuple[List[str], Tuple[Dataset, ...], List[Transformer]][source]

Load Delaney dataset

The Delaney (ESOL) dataset a regression dataset containing structures and water solubility data for 1128 compounds. The dataset is widely used to validate machine learning models on estimating solubility directly from molecular structures (as encoded in SMILES strings).

Scaffold splitting is recommended for this dataset.

The raw data csv file contains columns below:

  • “Compound ID” - Name of the compound

  • “smiles” - SMILES representation of the molecular structure

  • “measured log solubility in mols per litre” - Log-scale water solubility

    of the compound, used as label

Parameters:
  • featurizer (Featurizer or str) – the featurizer to use for processing the data. Alternatively you can pass one of the names from dc.molnet.featurizers as a shortcut.

  • splitter (Splitter or str) – the splitter to use for splitting the data into training, validation, and test sets. Alternatively you can pass one of the names from dc.molnet.splitters as a shortcut. If this is None, all the data will be included in a single dataset.

  • transformers (list of TransformerGenerators or strings) – the Transformers to apply to the data. Each one is specified by a TransformerGenerator or, as a shortcut, one of the names from dc.molnet.transformers.

  • reload (bool) – if True, the first call for a particular featurizer and splitter will cache the datasets to disk, and subsequent calls will reload the cached datasets.

  • data_dir (str) – a directory to save the raw data in

  • save_dir (str) – a directory to save the dataset in

References

Factors Datasets

load_factors(shard_size=2000, featurizer=None, split=None, reload=True)[source]

Loads FACTOR dataset; does not do train/test split

The Factors dataset is an in-house dataset from Merck that was first introduced in the following paper: Ramsundar, Bharath, et al. “Is multitask deep learning practical for pharma?.” Journal of chemical information and modeling 57.8 (2017): 2068-2076.

It contains 1500 Merck in-house compounds that were measured for IC50 of inhibition on 12 serine proteases. Unlike most of the other datasets featured in MoleculeNet, the Factors collection does not have structures for the compounds tested since they were proprietary Merck compounds. However, the collection does feature pre-computed descriptors for these compounds.

Note that the original train/valid/test split from the source data was preserved here, so this function doesn’t allow for alternate modes of splitting. Similarly, since the source data came pre-featurized, it is not possible to apply alternative featurizations.

Parameters:
  • shard_size (int, optional) – Size of the DiskDataset shards to write on disk

  • featurizer (optional) – Ignored since featurization pre-computed

  • split (optional) – Ignored since split pre-computed

  • reload (bool, optional) – Whether to automatically re-load from disk

Freesolv Dataset

load_freesolv(featurizer: ~deepchem.feat.base_classes.Featurizer | str = MATFeaturizer[], splitter: ~deepchem.splits.splitters.Splitter | str | None = 'random', transformers: ~typing.List[~deepchem.molnet.load_function.molnet_loader.TransformerGenerator | str] = ['normalization'], reload: bool = True, data_dir: str | None = None, save_dir: str | None = None, **kwargs) Tuple[List[str], Tuple[Dataset, ...], List[Transformer]][source]

Load Freesolv dataset

The FreeSolv dataset is a collection of experimental and calculated hydration free energies for small molecules in water, along with their experiemental values. Here, we are using a modified version of the dataset with the molecule smile string and the corresponding experimental hydration free energies.

Random splitting is recommended for this dataset.

The raw data csv file contains columns below:

  • “mol” - SMILES representation of the molecular structure

  • “y” - Experimental hydration free energy

Parameters:
  • featurizer (Featurizer or str) – the featurizer to use for processing the data. Alternatively you can pass one of the names from dc.molnet.featurizers as a shortcut.

  • splitter (Splitter or str) – the splitter to use for splitting the data into training, validation, and test sets. Alternatively you can pass one of the names from dc.molnet.splitters as a shortcut. If this is None, all the data will be included in a single dataset.

  • transformers (list of TransformerGenerators or strings) – the Transformers to apply to the data. Each one is specified by a TransformerGenerator or, as a shortcut, one of the names from dc.molnet.transformers.

  • reload (bool) – if True, the first call for a particular featurizer and splitter will cache the datasets to disk, and subsequent calls will reload the cached datasets.

  • data_dir (str) – a directory to save the raw data in

  • save_dir (str) – a directory to save the dataset in

References

HIV Datasets

load_hiv(featurizer: Featurizer | str = 'ECFP', splitter: Splitter | str | None = 'scaffold', transformers: List[TransformerGenerator | str] = ['balancing'], reload: bool = True, data_dir: str | None = None, save_dir: str | None = None, **kwargs) Tuple[List[str], Tuple[Dataset, ...], List[Transformer]][source]

Load HIV dataset

The HIV dataset was introduced by the Drug Therapeutics Program (DTP) AIDS Antiviral Screen, which tested the ability to inhibit HIV replication for over 40,000 compounds. Screening results were evaluated and placed into three categories: confirmed inactive (CI),confirmed active (CA) and confirmed moderately active (CM). We further combine the latter two labels, making it a classification task between inactive (CI) and active (CA and CM).

Scaffold splitting is recommended for this dataset.

The raw data csv file contains columns below:

  • “smiles”: SMILES representation of the molecular structure

  • “activity”: Three-class labels for screening results: CI/CM/CA

  • “HIV_active”: Binary labels for screening results: 1 (CA/CM) and 0 (CI)

Parameters:
  • featurizer (Featurizer or str) – the featurizer to use for processing the data. Alternatively you can pass one of the names from dc.molnet.featurizers as a shortcut.

  • splitter (Splitter or str) – the splitter to use for splitting the data into training, validation, and test sets. Alternatively you can pass one of the names from dc.molnet.splitters as a shortcut. If this is None, all the data will be included in a single dataset.

  • transformers (list of TransformerGenerators or strings) – the Transformers to apply to the data. Each one is specified by a TransformerGenerator or, as a shortcut, one of the names from dc.molnet.transformers.

  • reload (bool) – if True, the first call for a particular featurizer and splitter will cache the datasets to disk, and subsequent calls will reload the cached datasets.

  • data_dir (str) – a directory to save the raw data in

  • save_dir (str) – a directory to save the dataset in

References

HOPV Datasets

HOPV stands for the Harvard Organic Photovoltaic Dataset.

load_hopv(featurizer: Featurizer | str = 'ECFP', splitter: Splitter | str | None = 'scaffold', transformers: List[TransformerGenerator | str] = ['normalization'], reload: bool = True, data_dir: str | None = None, save_dir: str | None = None, **kwargs) Tuple[List[str], Tuple[Dataset, ...], List[Transformer]][source]

Load HOPV datasets. Does not do train/test split

The HOPV datasets consist of the “Harvard Organic Photovoltaic Dataset. This dataset includes 350 small molecules and polymers that were utilized as p-type materials in OPVs. Experimental properties include: HOMO [a.u.], LUMO [a.u.], Electrochemical gap [a.u.], Optical gap [a.u.], Power conversion efficiency [%], Open circuit potential [V], Short circuit current density [mA/cm^2], and fill factor [%]. Theoretical calculations in the original dataset have been removed (for now).

Lopez, Steven A., et al. “The Harvard organic photovoltaic dataset.” Scientific data 3.1 (2016): 1-7.

Parameters:
  • featurizer (Featurizer or str) – the featurizer to use for processing the data. Alternatively you can pass one of the names from dc.molnet.featurizers as a shortcut.

  • splitter (Splitter or str) – the splitter to use for splitting the data into training, validation, and test sets. Alternatively you can pass one of the names from dc.molnet.splitters as a shortcut. If this is None, all the data will be included in a single dataset.

  • transformers (list of TransformerGenerators or strings) – the Transformers to apply to the data. Each one is specified by a TransformerGenerator or, as a shortcut, one of the names from dc.molnet.transformers.

  • reload (bool) – if True, the first call for a particular featurizer and splitter will cache the datasets to disk, and subsequent calls will reload the cached datasets.

  • data_dir (str) – a directory to save the raw data in

  • save_dir (str) – a directory to save the dataset in

HPPB Datasets

load_hppb(featurizer: Featurizer | str = 'ECFP', splitter: Splitter | str | None = 'scaffold', transformers: List[TransformerGenerator | str] = ['log'], reload: bool = True, data_dir: str | None = None, save_dir: str | None = None, **kwargs) Tuple[List[str], Tuple[Dataset, ...], List[Transformer]][source]

Loads the thermodynamic solubility datasets.

Parameters:
  • featurizer (Featurizer or str) – the featurizer to use for processing the data. Alternatively you can pass one of the names from dc.molnet.featurizers as a shortcut.

  • splitter (Splitter or str) – the splitter to use for splitting the data into training, validation, and test sets. Alternatively you can pass one of the names from dc.molnet.splitters as a shortcut. If this is None, all the data will be included in a single dataset.

  • transformers (list of TransformerGenerators or strings) – the Transformers to apply to the data. Each one is specified by a TransformerGenerator or, as a shortcut, one of the names from dc.molnet.transformers.

  • reload (bool) – if True, the first call for a particular featurizer and splitter will cache the datasets to disk, and subsequent calls will reload the cached datasets.

  • data_dir (str) – a directory to save the raw data in

  • save_dir (str) – a directory to save the dataset in

KAGGLE Datasets

load_kaggle(shard_size=2000, featurizer=None, split=None, reload=True)[source]

Loads kaggle datasets. Generates if not stored already.

The Kaggle dataset is an in-house dataset from Merck that was first introduced in the following paper:

Ma, Junshui, et al. “Deep neural nets as a method for quantitative structure–activity relationships.” Journal of chemical information and modeling 55.2 (2015): 263-274.

It contains 100,000 unique Merck in-house compounds that were measured on 15 enzyme inhibition and ADME/TOX datasets. Unlike most of the other datasets featured in MoleculeNet, the Kaggle collection does not have structures for the compounds tested since they were proprietary Merck compounds. However, the collection does feature pre-computed descriptors for these compounds.

Note that the original train/valid/test split from the source data was preserved here, so this function doesn’t allow for alternate modes of splitting. Similarly, since the source data came pre-featurized, it is not possible to apply alternative featurizations.

Parameters:
  • shard_size (int, optional) – Size of the DiskDataset shards to write on disk

  • featurizer (optional) – Ignored since featurization pre-computed

  • split (optional) – Ignored since split pre-computed

  • reload (bool, optional) – Whether to automatically re-load from disk

Kinase Datasets

load_kinase(shard_size=2000, featurizer=None, split=None, reload=True)[source]

Loads Kinase datasets, does not do train/test split

The Kinase dataset is an in-house dataset from Merck that was first introduced in the following paper: Ramsundar, Bharath, et al. “Is multitask deep learning practical for pharma?.” Journal of chemical information and modeling 57.8 (2017): 2068-2076.

It contains 2500 Merck in-house compounds that were measured for IC50 of inhibition on 99 protein kinases. Unlike most of the other datasets featured in MoleculeNet, the Kinase collection does not have structures for the compounds tested since they were proprietary Merck compounds. However, the collection does feature pre-computed descriptors for these compounds.

Note that the original train/valid/test split from the source data was preserved here, so this function doesn’t allow for alternate modes of splitting. Similarly, since the source data came pre-featurized, it is not possible to apply alternative featurizations.

Parameters:
  • shard_size (int, optional) – Size of the DiskDataset shards to write on disk

  • featurizer (optional) – Ignored since featurization pre-computed

  • split (optional) – Ignored since split pre-computed

  • reload (bool, optional) – Whether to automatically re-load from disk

Lipo Datasets

load_lipo(featurizer: Featurizer | str = 'ECFP', splitter: Splitter | str | None = 'scaffold', transformers: List[TransformerGenerator | str] = ['normalization'], reload: bool = True, data_dir: str | None = None, save_dir: str | None = None, **kwargs) Tuple[List[str], Tuple[Dataset, ...], List[Transformer]][source]

Load Lipophilicity dataset

Lipophilicity is an important feature of drug molecules that affects both membrane permeability and solubility. The lipophilicity dataset, curated from ChEMBL database, provides experimental results of octanol/water distribution coefficient (logD at pH 7.4) of 4200 compounds.

Scaffold splitting is recommended for this dataset.

The raw data csv file contains columns below:

  • “smiles” - SMILES representation of the molecular structure

  • “exp” - Measured octanol/water distribution coefficient (logD) of the

    compound, used as label

Parameters:
  • featurizer (Featurizer or str) – the featurizer to use for processing the data. Alternatively you can pass one of the names from dc.molnet.featurizers as a shortcut.

  • splitter (Splitter or str) – the splitter to use for splitting the data into training, validation, and test sets. Alternatively you can pass one of the names from dc.molnet.splitters as a shortcut. If this is None, all the data will be included in a single dataset.

  • transformers (list of TransformerGenerators or strings) – the Transformers to apply to the data. Each one is specified by a TransformerGenerator or, as a shortcut, one of the names from dc.molnet.transformers.

  • reload (bool) – if True, the first call for a particular featurizer and splitter will cache the datasets to disk, and subsequent calls will reload the cached datasets.

  • data_dir (str) – a directory to save the raw data in

  • save_dir (str) – a directory to save the dataset in

References

Materials Datasets

Materials datasets include inorganic crystal structures, chemical compositions, and target properties like formation energies and band gaps. Machine learning problems in materials science commonly include predicting the value of a continuous (regression) or categorical (classification) property of a material based on its chemical composition or crystal structure. “Inverse design” is also of great interest, in which ML methods generate crystal structures that have a desired property. Other areas where ML is applicable in materials include: discovering new or modified phenomenological models that describe material behavior

load_bandgap(featurizer: ~deepchem.feat.base_classes.Featurizer | str = ElementPropertyFingerprint[data_source='matminer'], splitter: ~deepchem.splits.splitters.Splitter | str | None = 'random', transformers: ~typing.List[~deepchem.molnet.load_function.molnet_loader.TransformerGenerator | str] = ['normalization'], reload: bool = True, data_dir: str | None = None, save_dir: str | None = None, **kwargs) Tuple[List[str], Tuple[Dataset, ...], List[Transformer]][source]

Load band gap dataset.

Contains 4604 experimentally measured band gaps for inorganic crystal structure compositions. In benchmark studies, random forest models achieved a mean average error of 0.45 eV during five-fold nested cross validation on this dataset.

For more details on the dataset see [1]_. For more details on previous benchmarks for this dataset, see [2]_.

Parameters:
  • featurizer (Featurizer or str) – the featurizer to use for processing the data. Alternatively you can pass one of the names from dc.molnet.featurizers as a shortcut.

  • splitter (Splitter or str) – the splitter to use for splitting the data into training, validation, and test sets. Alternatively you can pass one of the names from dc.molnet.splitters as a shortcut. If this is None, all the data will be included in a single dataset.

  • transformers (list of TransformerGenerators or strings) – the Transformers to apply to the data. Each one is specified by a TransformerGenerator or, as a shortcut, one of the names from dc.molnet.transformers.

  • reload (bool) – if True, the first call for a particular featurizer and splitter will cache the datasets to disk, and subsequent calls will reload the cached datasets.

  • data_dir (str) – a directory to save the raw data in

  • save_dir (str) – a directory to save the dataset in

Returns:

tasks, datasets, transformers

taskslist

Column names corresponding to machine learning target variables.

datasetstuple

train, validation, test splits of data as deepchem.data.datasets.Dataset instances.

transformerslist

deepchem.trans.transformers.Transformer instances applied to dataset.

Return type:

tuple

References

Examples

>>>
>> import deepchem as dc
>> tasks, datasets, transformers = dc.molnet.load_bandgap()
>> train_dataset, val_dataset, test_dataset = datasets
>> n_tasks = len(tasks)
>> n_features = train_dataset.get_data_shape()[0]
>> model = dc.models.MultitaskRegressor(n_tasks, n_features)
load_perovskite(featurizer: ~deepchem.feat.base_classes.Featurizer | str = DummyFeaturizer[], splitter: ~deepchem.splits.splitters.Splitter | str | None = 'random', transformers: ~typing.List[~deepchem.molnet.load_function.molnet_loader.TransformerGenerator | str] = ['normalization'], reload: bool = True, data_dir: str | None = None, save_dir: str | None = None, **kwargs) Tuple[List[str], Tuple[Dataset, ...], List[Transformer]][source]

Load perovskite dataset.

Contains 18928 perovskite structures and their formation energies. In benchmark studies, random forest models and crystal graph neural networks achieved mean average error of 0.23 and 0.05 eV/atom, respectively, during five-fold nested cross validation on this dataset.

For more details on the dataset see [1]_. For more details on previous benchmarks for this dataset, see [2]_.

Parameters:
  • featurizer (Featurizer or str) – the featurizer to use for processing the data. Alternatively you can pass one of the names from dc.molnet.featurizers as a shortcut.

  • splitter (Splitter or str) – the splitter to use for splitting the data into training, validation, and test sets. Alternatively you can pass one of the names from dc.molnet.splitters as a shortcut. If this is None, all the data will be included in a single dataset.

  • transformers (list of TransformerGenerators or strings) – the Transformers to apply to the data. Each one is specified by a TransformerGenerator or, as a shortcut, one of the names from dc.molnet.transformers.

  • reload (bool) – if True, the first call for a particular featurizer and splitter will cache the datasets to disk, and subsequent calls will reload the cached datasets.

  • data_dir (str) – a directory to save the raw data in

  • save_dir (str) – a directory to save the dataset in

Returns:

tasks, datasets, transformers

taskslist

Column names corresponding to machine learning target variables.

datasetstuple

train, validation, test splits of data as deepchem.data.datasets.Dataset instances.

transformerslist

deepchem.trans.transformers.Transformer instances applied to dataset.

Return type:

tuple

References

Examples

>>> import deepchem as dc
>>> tasks, datasets, transformers = dc.molnet.load_perovskite()
>>> train_dataset, val_dataset, test_dataset = datasets
>>> model = dc.models.CGCNNModel(mode='regression', batch_size=32, learning_rate=0.001)
load_mp_formation_energy(featurizer: ~deepchem.feat.base_classes.Featurizer | str = SineCoulombMatrix[max_atoms=100, flatten=True], splitter: ~deepchem.splits.splitters.Splitter | str | None = 'random', transformers: ~typing.List[~deepchem.molnet.load_function.molnet_loader.TransformerGenerator | str] = ['normalization'], reload: bool = True, data_dir: str | None = None, save_dir: str | None = None, **kwargs) Tuple[List[str], Tuple[Dataset, ...], List[Transformer]][source]

Load mp formation energy dataset.

Contains 132752 calculated formation energies and inorganic crystal structures from the Materials Project database. In benchmark studies, random forest models achieved a mean average error of 0.116 eV/atom during five-folded nested cross validation on this dataset.

For more details on the dataset see [1]_. For more details on previous benchmarks for this dataset, see [2]_.

Parameters:
  • featurizer (Featurizer or str) – the featurizer to use for processing the data. Alternatively you can pass one of the names from dc.molnet.featurizers as a shortcut.

  • splitter (Splitter or str) – the splitter to use for splitting the data into training, validation, and test sets. Alternatively you can pass one of the names from dc.molnet.splitters as a shortcut. If this is None, all the data will be included in a single dataset.

  • transformers (list of TransformerGenerators or strings) – the Transformers to apply to the data. Each one is specified by a TransformerGenerator or, as a shortcut, one of the names from dc.molnet.transformers.

  • reload (bool) – if True, the first call for a particular featurizer and splitter will cache the datasets to disk, and subsequent calls will reload the cached datasets.

  • data_dir (str) – a directory to save the raw data in

  • save_dir (str) – a directory to save the dataset in

Returns:

tasks, datasets, transformers

taskslist

Column names corresponding to machine learning target variables.

datasetstuple

train, validation, test splits of data as deepchem.data.datasets.Dataset instances.

transformerslist

deepchem.trans.transformers.Transformer instances applied to dataset.

Return type:

tuple

References

Examples

>>>
>> import deepchem as dc
>> tasks, datasets, transformers = dc.molnet.load_mp_formation_energy()
>> train_dataset, val_dataset, test_dataset = datasets
>> n_tasks = len(tasks)
>> n_features = train_dataset.get_data_shape()[0]
>> model = dc.models.MultitaskRegressor(n_tasks, n_features)
load_mp_metallicity(featurizer: ~deepchem.feat.base_classes.Featurizer | str = SineCoulombMatrix[max_atoms=100, flatten=True], splitter: ~deepchem.splits.splitters.Splitter | str | None = 'random', transformers: ~typing.List[~deepchem.molnet.load_function.molnet_loader.TransformerGenerator | str] = ['balancing'], reload: bool = True, data_dir: str | None = None, save_dir: str | None = None, **kwargs) Tuple[List[str], Tuple[Dataset, ...], List[Transformer]][source]

Load mp formation energy dataset.

Contains 106113 inorganic crystal structures from the Materials Project database labeled as metals or nonmetals. In benchmark studies, random forest models achieved a mean ROC-AUC of 0.9 during five-folded nested cross validation on this dataset.

For more details on the dataset see [1]_. For more details on previous benchmarks for this dataset, see [2]_.

Parameters:
  • featurizer (Featurizer or str) – the featurizer to use for processing the data. Alternatively you can pass one of the names from dc.molnet.featurizers as a shortcut.

  • splitter (Splitter or str) – the splitter to use for splitting the data into training, validation, and test sets. Alternatively you can pass one of the names from dc.molnet.splitters as a shortcut. If this is None, all the data will be included in a single dataset.

  • transformers (list of TransformerGenerators or strings) – the Transformers to apply to the data. Each one is specified by a TransformerGenerator or, as a shortcut, one of the names from dc.molnet.transformers.

  • reload (bool) – if True, the first call for a particular featurizer and splitter will cache the datasets to disk, and subsequent calls will reload the cached datasets.

  • data_dir (str) – a directory to save the raw data in

  • save_dir (str) – a directory to save the dataset in

Returns:

tasks, datasets, transformers

taskslist

Column names corresponding to machine learning target variables.

datasetstuple

train, validation, test splits of data as deepchem.data.datasets.Dataset instances.

transformerslist

deepchem.trans.transformers.Transformer instances applied to dataset.

Return type:

tuple

References

Examples

>>>
>> import deepchem as dc
>> tasks, datasets, transformers = dc.molnet.load_mp_metallicity()
>> train_dataset, val_dataset, test_dataset = datasets
>> n_tasks = len(tasks)
>> n_features = train_dataset.get_data_shape()[0]
>> model = dc.models.MultitaskRegressor(n_tasks, n_features)

MUV Datasets

load_muv(featurizer: Featurizer | str = 'ECFP', splitter: Splitter | str | None = 'scaffold', transformers: List[TransformerGenerator | str] = ['balancing'], reload: bool = True, data_dir: str | None = None, save_dir: str | None = None, **kwargs) Tuple[List[str], Tuple[Dataset, ...], List[Transformer]][source]

Load MUV dataset

The Maximum Unbiased Validation (MUV) group is a benchmark dataset selected from PubChem BioAssay by applying a refined nearest neighbor analysis.

The MUV dataset contains 17 challenging tasks for around 90 thousand compounds and is specifically designed for validation of virtual screening techniques.

Scaffold splitting is recommended for this dataset.

The raw data csv file contains columns below:

  • “mol_id” - PubChem CID of the compound

  • “smiles” - SMILES representation of the molecular structure

  • “MUV-XXX” - Measured results (Active/Inactive) for bioassays

Parameters:
  • featurizer (Featurizer or str) – the featurizer to use for processing the data. Alternatively you can pass one of the names from dc.molnet.featurizers as a shortcut.

  • splitter (Splitter or str) – the splitter to use for splitting the data into training, validation, and test sets. Alternatively you can pass one of the names from dc.molnet.splitters as a shortcut. If this is None, all the data will be included in a single dataset.

  • transformers (list of TransformerGenerators or strings) – the Transformers to apply to the data. Each one is specified by a TransformerGenerator or, as a shortcut, one of the names from dc.molnet.transformers.

  • reload (bool) – if True, the first call for a particular featurizer and splitter will cache the datasets to disk, and subsequent calls will reload the cached datasets.

  • data_dir (str) – a directory to save the raw data in

  • save_dir (str) – a directory to save the dataset in

References

NCI Datasets

load_nci(featurizer: Featurizer | str = 'ECFP', splitter: Splitter | str | None = 'random', transformers: List[TransformerGenerator | str] = ['normalization'], reload: bool = True, data_dir: str | None = None, save_dir: str | None = None, **kwargs) Tuple[List[str], Tuple[Dataset, ...], List[Transformer]][source]

Load NCI dataset.

Parameters:
  • featurizer (Featurizer or str) – the featurizer to use for processing the data. Alternatively you can pass one of the names from dc.molnet.featurizers as a shortcut.

  • splitter (Splitter or str) – the splitter to use for splitting the data into training, validation, and test sets. Alternatively you can pass one of the names from dc.molnet.splitters as a shortcut. If this is None, all the data will be included in a single dataset.

  • transformers (list of TransformerGenerators or strings) – the Transformers to apply to the data. Each one is specified by a TransformerGenerator or, as a shortcut, one of the names from dc.molnet.transformers.

  • reload (bool) – if True, the first call for a particular featurizer and splitter will cache the datasets to disk, and subsequent calls will reload the cached datasets.

  • data_dir (str) – a directory to save the raw data in

  • save_dir (str) – a directory to save the dataset in

PCBA Datasets

load_pcba(featurizer: Featurizer | str = 'ECFP', splitter: Splitter | str | None = 'scaffold', transformers: List[TransformerGenerator | str] = ['balancing'], reload: bool = True, data_dir: str | None = None, save_dir: str | None = None, **kwargs) Tuple[List[str], Tuple[Dataset, ...], List[Transformer]][source]

Load PCBA dataset

PubChem BioAssay (PCBA) is a database consisting of biological activities of small molecules generated by high-throughput screening. We use a subset of PCBA, containing 128 bioassays measured over 400 thousand compounds, used by previous work to benchmark machine learning methods.

Random splitting is recommended for this dataset.

The raw data csv file contains columns below:

Parameters:
  • featurizer (Featurizer or str) – the featurizer to use for processing the data. Alternatively you can pass one of the names from dc.molnet.featurizers as a shortcut.

  • splitter (Splitter or str) – the splitter to use for splitting the data into training, validation, and test sets. Alternatively you can pass one of the names from dc.molnet.splitters as a shortcut. If this is None, all the data will be included in a single dataset.

  • transformers (list of TransformerGenerators or strings) – the Transformers to apply to the data. Each one is specified by a TransformerGenerator or, as a shortcut, one of the names from dc.molnet.transformers.

  • reload (bool) – if True, the first call for a particular featurizer and splitter will cache the datasets to disk, and subsequent calls will reload the cached datasets.

  • data_dir (str) – a directory to save the raw data in

  • save_dir (str) – a directory to save the dataset in

References

PDBBIND Datasets

load_pdbbind(featurizer: ComplexFeaturizer, splitter: Splitter | str | None = 'random', transformers: List[TransformerGenerator | str] = ['normalization'], reload: bool = True, data_dir: str | None = None, save_dir: str | None = None, pocket: bool = True, set_name: str = 'core', **kwargs) Tuple[List[str], Tuple[Dataset, ...], List[Transformer]][source]

Load PDBBind dataset.

The PDBBind dataset includes experimental binding affinity data and structures for 4852 protein-ligand complexes from the “refined set” and 12800 complexes from the “general set” in PDBBind v2019 and 193 complexes from the “core set” in PDBBind v2013. The refined set removes data with obvious problems in 3D structure, binding data, or other aspects and should therefore be a better starting point for docking/scoring studies. Details on the criteria used to construct the refined set can be found in [4]_. The general set does not include the refined set. The core set is a subset of the refined set that is not updated annually.

Random splitting is recommended for this dataset.

The raw dataset contains the columns below:

  • “ligand” - SDF of the molecular structure

  • “protein” - PDB of the protein structure

  • “CT_TOX” - Clinical trial results

Parameters:
  • featurizer (ComplexFeaturizer or str) – the complex featurizer to use for processing the data. Alternatively you can pass one of the names from dc.molnet.featurizers as a shortcut.

  • splitter (Splitter or str) – the splitter to use for splitting the data into training, validation, and test sets. Alternatively you can pass one of the names from dc.molnet.splitters as a shortcut. If this is None, all the data will be included in a single dataset.

  • transformers (list of TransformerGenerators or strings) – the Transformers to apply to the data. Each one is specified by a TransformerGenerator or, as a shortcut, one of the names from dc.molnet.transformers.

  • reload (bool) – if True, the first call for a particular featurizer and splitter will cache the datasets to disk, and subsequent calls will reload the cached datasets.

  • data_dir (str) – a directory to save the raw data in

  • save_dir (str) – a directory to save the dataset in

  • pocket (bool (default True)) – If true, use only the binding pocket for featurization.

  • set_name (str (default 'core')) – Name of dataset to download. ‘refined’, ‘general’, and ‘core’ are supported.

Returns:

tasks, datasets, transformers

tasks: list

Column names corresponding to machine learning target variables.

datasets: tuple

train, validation, test splits of data as deepchem.data.datasets.Dataset instances.

transformers: list

deepchem.trans.transformers.Transformer instances applied to dataset.

Return type:

tuple

References

PPB Datasets

load_ppb(featurizer: Featurizer | str = 'ECFP', splitter: Splitter | str | None = 'scaffold', transformers: List[TransformerGenerator | str] = ['normalization'], reload: bool = True, data_dir: str | None = None, save_dir: str | None = None, **kwargs) Tuple[List[str], Tuple[Dataset, ...], List[Transformer]][source]

Load PPB datasets.

Parameters:
  • featurizer (Featurizer or str) – the featurizer to use for processing the data. Alternatively you can pass one of the names from dc.molnet.featurizers as a shortcut.

  • splitter (Splitter or str) – the splitter to use for splitting the data into training, validation, and test sets. Alternatively you can pass one of the names from dc.molnet.splitters as a shortcut. If this is None, all the data will be included in a single dataset.

  • transformers (list of TransformerGenerators or strings) – the Transformers to apply to the data. Each one is specified by a TransformerGenerator or, as a shortcut, one of the names from dc.molnet.transformers.

  • reload (bool) – if True, the first call for a particular featurizer and splitter will cache the datasets to disk, and subsequent calls will reload the cached datasets.

  • data_dir (str) – a directory to save the raw data in

  • save_dir (str) – a directory to save the dataset in

QM7 Datasets

load_qm7(featurizer: ~deepchem.feat.base_classes.Featurizer | str = CoulombMatrix[max_atoms=23, remove_hydrogens=False, randomize=False, upper_tri=False, n_samples=1, seed=None], splitter: ~deepchem.splits.splitters.Splitter | str | None = 'random', transformers: ~typing.List[~deepchem.molnet.load_function.molnet_loader.TransformerGenerator | str] = ['normalization'], reload: bool = True, data_dir: str | None = None, save_dir: str | None = None, **kwargs) Tuple[List[str], Tuple[Dataset, ...], List[Transformer]][source]

Load QM7 dataset

QM7 is a subset of GDB-13 (a database of nearly 1 billion stable and synthetically accessible organic molecules) containing up to 7 heavy atoms C, N, O, and S. The 3D Cartesian coordinates of the most stable conformations and their atomization energies were determined using ab-initio density functional theory (PBE0/tier2 basis set). This dataset also provided Coulomb matrices as calculated in [Rupp et al. PRL, 2012]:

Stratified splitting is recommended for this dataset.

The data file (.mat format, we recommend using scipy.io.loadmat for python users to load this original data) contains five arrays:

  • “X” - (7165 x 23 x 23), Coulomb matrices

  • “T” - (7165), atomization energies (unit: kcal/mol)

  • “P” - (5 x 1433), cross-validation splits as used in [Montavon et al.

    NIPS, 2012]

  • “Z” - (7165 x 23), atomic charges

  • “R” - (7165 x 23 x 3), cartesian coordinate (unit: Bohr) of each atom in

    the molecules

Parameters:
  • featurizer (Featurizer or str) – the featurizer to use for processing the data. Alternatively you can pass one of the names from dc.molnet.featurizers as a shortcut.

  • splitter (Splitter or str) – the splitter to use for splitting the data into training, validation, and test sets. Alternatively you can pass one of the names from dc.molnet.splitters as a shortcut. If this is None, all the data will be included in a single dataset.

  • transformers (list of TransformerGenerators or strings) – the Transformers to apply to the data. Each one is specified by a TransformerGenerator or, as a shortcut, one of the names from dc.molnet.transformers.

  • reload (bool) – if True, the first call for a particular featurizer and splitter will cache the datasets to disk, and subsequent calls will reload the cached datasets.

  • data_dir (str) – a directory to save the raw data in

  • save_dir (str) – a directory to save the dataset in

Note

DeepChem 2.4.0 has turned on sanitization for this dataset by default. For the QM7 dataset, this means that calling this function will return 6838 compounds instead of 7160 in the source dataset file. This appears to be due to valence specification mismatches in the dataset that weren’t caught in earlier more lax versions of RDKit. Note that this may subtly affect benchmarking results on this dataset.

References

QM8 Datasets

load_qm8(featurizer: ~deepchem.feat.base_classes.Featurizer | str = CoulombMatrix[max_atoms=26, remove_hydrogens=False, randomize=False, upper_tri=False, n_samples=1, seed=None], splitter: ~deepchem.splits.splitters.Splitter | str | None = 'random', transformers: ~typing.List[~deepchem.molnet.load_function.molnet_loader.TransformerGenerator | str] = ['normalization'], reload: bool = True, data_dir: str | None = None, save_dir: str | None = None, **kwargs) Tuple[List[str], Tuple[Dataset, ...], List[Transformer]][source]

Load QM8 dataset

QM8 is the dataset used in a study on modeling quantum mechanical calculations of electronic spectra and excited state energy of small molecules. Multiple methods, including time-dependent density functional theories (TDDFT) and second-order approximate coupled-cluster (CC2), are applied to a collection of molecules that include up to eight heavy atoms (also a subset of the GDB-17 database). In our collection, there are four excited state properties calculated by four different methods on 22 thousand samples:

S0 -> S1 transition energy E1 and the corresponding oscillator strength f1

S0 -> S2 transition energy E2 and the corresponding oscillator strength f2

E1, E2, f1, f2 are in atomic units. f1, f2 are in length representation

Random splitting is recommended for this dataset.

The source data contain:

  • qm8.sdf: molecular structures

  • qm8.sdf.csv: tables for molecular properties

  • Column 1: Molecule ID (gdb9 index) mapping to the .sdf file

  • Columns 2-5: RI-CC2/def2TZVP

  • Columns 6-9: LR-TDPBE0/def2SVP

  • Columns 10-13: LR-TDPBE0/def2TZVP

  • Columns 14-17: LR-TDCAM-B3LYP/def2TZVP

Parameters:
  • featurizer (Featurizer or str) – the featurizer to use for processing the data. Alternatively you can pass one of the names from dc.molnet.featurizers as a shortcut.

  • splitter (Splitter or str) – the splitter to use for splitting the data into training, validation, and test sets. Alternatively you can pass one of the names from dc.molnet.splitters as a shortcut. If this is None, all the data will be included in a single dataset.

  • transformers (list of TransformerGenerators or strings) – the Transformers to apply to the data. Each one is specified by a TransformerGenerator or, as a shortcut, one of the names from dc.molnet.transformers.

  • reload (bool) – if True, the first call for a particular featurizer and splitter will cache the datasets to disk, and subsequent calls will reload the cached datasets.

  • data_dir (str) – a directory to save the raw data in

  • save_dir (str) – a directory to save the dataset in

Note

DeepChem 2.4.0 has turned on sanitization for this dataset by default. For the QM8 dataset, this means that calling this function will return 21747 compounds instead of 21786 in the source dataset file. This appears to be due to valence specification mismatches in the dataset that weren’t caught in earlier more lax versions of RDKit. Note that this may subtly affect benchmarking results on this dataset.

References

QM9 Datasets

load_qm9(featurizer: ~deepchem.feat.base_classes.Featurizer | str = CoulombMatrix[max_atoms=29, remove_hydrogens=False, randomize=False, upper_tri=False, n_samples=1, seed=None], splitter: ~deepchem.splits.splitters.Splitter | str | None = 'random', transformers: ~typing.List[~deepchem.molnet.load_function.molnet_loader.TransformerGenerator | str] = ['normalization'], reload: bool = True, data_dir: str | None = None, save_dir: str | None = None, **kwargs) Tuple[List[str], Tuple[Dataset, ...], List[Transformer]][source]

Load QM9 dataset

QM9 is a comprehensive dataset that provides geometric, energetic, electronic and thermodynamic properties for a subset of GDB-17 database, comprising 134 thousand stable organic molecules with up to 9 heavy atoms. All molecules are modeled using density functional theory (B3LYP/6-31G(2df,p) based DFT).

Random splitting is recommended for this dataset.

The source data contain:

  • qm9.sdf: molecular structures

  • qm9.sdf.csv: tables for molecular properties

  • “mol_id” - Molecule ID (gdb9 index) mapping to the .sdf file

  • “A” - Rotational constant (unit: GHz)

  • “B” - Rotational constant (unit: GHz)

  • “C” - Rotational constant (unit: GHz)

  • “mu” - Dipole moment (unit: D)

  • “alpha” - Isotropic polarizability (unit: Bohr^3)

  • “homo” - Highest occupied molecular orbital energy (unit: Hartree)

  • “lumo” - Lowest unoccupied molecular orbital energy (unit: Hartree)

  • “gap” - Gap between HOMO and LUMO (unit: Hartree)

  • “r2” - Electronic spatial extent (unit: Bohr^2)

  • “zpve” - Zero point vibrational energy (unit: Hartree)

  • “u0” - Internal energy at 0K (unit: Hartree)

  • “u298” - Internal energy at 298.15K (unit: Hartree)

  • “h298” - Enthalpy at 298.15K (unit: Hartree)

  • “g298” - Free energy at 298.15K (unit: Hartree)

  • “cv” - Heat capavity at 298.15K (unit: cal/(mol*K))

  • “u0_atom” - Atomization energy at 0K (unit: kcal/mol)

  • “u298_atom” - Atomization energy at 298.15K (unit: kcal/mol)

  • “h298_atom” - Atomization enthalpy at 298.15K (unit: kcal/mol)

  • “g298_atom” - Atomization free energy at 298.15K (unit: kcal/mol)

“u0_atom” ~ “g298_atom” (used in MoleculeNet) are calculated from the differences between “u0” ~ “g298” and sum of reference energies of all atoms in the molecules, as given in https://figshare.com/articles/Atomref%3A_Reference_thermochemical_energies_of_H%2C_C%2C_N%2C_O%2C_F_atoms./1057643

Parameters:
  • featurizer (Featurizer or str) – the featurizer to use for processing the data. Alternatively you can pass one of the names from dc.molnet.featurizers as a shortcut.

  • splitter (Splitter or str) – the splitter to use for splitting the data into training, validation, and test sets. Alternatively you can pass one of the names from dc.molnet.splitters as a shortcut. If this is None, all the data will be included in a single dataset.

  • transformers (list of TransformerGenerators or strings) – the Transformers to apply to the data. Each one is specified by a TransformerGenerator or, as a shortcut, one of the names from dc.molnet.transformers.

  • reload (bool) – if True, the first call for a particular featurizer and splitter will cache the datasets to disk, and subsequent calls will reload the cached datasets.

  • data_dir (str) – a directory to save the raw data in

  • save_dir (str) – a directory to save the dataset in

Note

DeepChem 2.4.0 has turned on sanitization for this dataset by default. For the QM9 dataset, this means that calling this function will return 132480 compounds instead of 133885 in the source dataset file. This appears to be due to valence specification mismatches in the dataset that weren’t caught in earlier more lax versions of RDKit. Note that this may subtly affect benchmarking results on this dataset.

References

SAMPL Datasets

load_sampl(featurizer: Featurizer | str = 'ECFP', splitter: Splitter | str | None = 'scaffold', transformers: List[TransformerGenerator | str] = ['normalization'], reload: bool = True, data_dir: str | None = None, save_dir: str | None = None, **kwargs) Tuple[List[str], Tuple[Dataset, ...], List[Transformer]][source]

Load SAMPL(FreeSolv) dataset

The Free Solvation Database, FreeSolv(SAMPL), provides experimental and calculated hydration free energy of small molecules in water. The calculated values are derived from alchemical free energy calculations using molecular dynamics simulations. The experimental values are included in the benchmark collection.

Random splitting is recommended for this dataset.

The raw data csv file contains columns below:

  • “iupac” - IUPAC name of the compound

  • “smiles” - SMILES representation of the molecular structure

  • “expt” - Measured solvation energy (unit: kcal/mol) of the compound,

    used as label

  • “calc” - Calculated solvation energy (unit: kcal/mol) of the compound

Parameters:
  • featurizer (Featurizer or str) – the featurizer to use for processing the data. Alternatively you can pass one of the names from dc.molnet.featurizers as a shortcut.

  • splitter (Splitter or str) – the splitter to use for splitting the data into training, validation, and test sets. Alternatively you can pass one of the names from dc.molnet.splitters as a shortcut. If this is None, all the data will be included in a single dataset.

  • transformers (list of TransformerGenerators or strings) – the Transformers to apply to the data. Each one is specified by a TransformerGenerator or, as a shortcut, one of the names from dc.molnet.transformers.

  • reload (bool) – if True, the first call for a particular featurizer and splitter will cache the datasets to disk, and subsequent calls will reload the cached datasets.

  • data_dir (str) – a directory to save the raw data in

  • save_dir (str) – a directory to save the dataset in

References

SIDER Datasets

load_sider(featurizer: Featurizer | str = 'ECFP', splitter: Splitter | str | None = 'scaffold', transformers: List[TransformerGenerator | str] = ['balancing'], reload: bool = True, data_dir: str | None = None, save_dir: str | None = None, **kwargs) Tuple[List[str], Tuple[Dataset, ...], List[Transformer]][source]

Load SIDER dataset

The Side Effect Resource (SIDER) is a database of marketed drugs and adverse drug reactions (ADR). The version of the SIDER dataset in DeepChem has grouped drug side effects into 27 system organ classes following MedDRA classifications measured for 1427 approved drugs.

Random splitting is recommended for this dataset.

The raw data csv file contains columns below:

  • “smiles”: SMILES representation of the molecular structure

  • “Hepatobiliary disorders” ~ “Injury, poisoning and procedural

    complications”: Recorded side effects for the drug. Please refer to http://sideeffects.embl.de/se/?page=98 for details on ADRs.

Parameters:
  • featurizer (Featurizer or str) – the featurizer to use for processing the data. Alternatively you can pass one of the names from dc.molnet.featurizers as a shortcut.

  • splitter (Splitter or str) – the splitter to use for splitting the data into training, validation, and test sets. Alternatively you can pass one of the names from dc.molnet.splitters as a shortcut. If this is None, all the data will be included in a single dataset.

  • transformers (list of TransformerGenerators or strings) – the Transformers to apply to the data. Each one is specified by a TransformerGenerator or, as a shortcut, one of the names from dc.molnet.transformers.

  • reload (bool) – if True, the first call for a particular featurizer and splitter will cache the datasets to disk, and subsequent calls will reload the cached datasets.

  • data_dir (str) – a directory to save the raw data in

  • save_dir (str) – a directory to save the dataset in

References

Thermosol Datasets

load_thermosol(featurizer: Featurizer | str = 'ECFP', splitter: Splitter | str | None = 'scaffold', transformers: List[TransformerGenerator | str] = [], reload: bool = True, data_dir: str | None = None, save_dir: str | None = None, **kwargs) Tuple[List[str], Tuple[Dataset, ...], List[Transformer]][source]

Loads the thermodynamic solubility datasets.

Parameters:
  • featurizer (Featurizer or str) – the featurizer to use for processing the data. Alternatively you can pass one of the names from dc.molnet.featurizers as a shortcut.

  • splitter (Splitter or str) – the splitter to use for splitting the data into training, validation, and test sets. Alternatively you can pass one of the names from dc.molnet.splitters as a shortcut. If this is None, all the data will be included in a single dataset.

  • transformers (list of TransformerGenerators or strings) – the Transformers to apply to the data. Each one is specified by a TransformerGenerator or, as a shortcut, one of the names from dc.molnet.transformers.

  • reload (bool) – if True, the first call for a particular featurizer and splitter will cache the datasets to disk, and subsequent calls will reload the cached datasets.

  • data_dir (str) – a directory to save the raw data in

  • save_dir (str) – a directory to save the dataset in

Tox21 Datasets

load_tox21(featurizer: Featurizer | str = 'ECFP', splitter: Splitter | str | None = 'scaffold', transformers: List[TransformerGenerator | str] = ['balancing'], reload: bool = True, data_dir: str | None = None, save_dir: str | None = None, tasks: List[str] = ['NR-AR', 'NR-AR-LBD', 'NR-AhR', 'NR-Aromatase', 'NR-ER', 'NR-ER-LBD', 'NR-PPAR-gamma', 'SR-ARE', 'SR-ATAD5', 'SR-HSE', 'SR-MMP', 'SR-p53'], **kwargs) Tuple[List[str], Tuple[Dataset, ...], List[Transformer]][source]

Load Tox21 dataset

The “Toxicology in the 21st Century” (Tox21) initiative created a public database measuring toxicity of compounds, which has been used in the 2014 Tox21 Data Challenge. This dataset contains qualitative toxicity measurements for 8k compounds on 12 different targets, including nuclear receptors and stress response pathways.

Random splitting is recommended for this dataset.

The raw data csv file contains columns below:

  • “smiles” - SMILES representation of the molecular structure

  • “NR-XXX” - Nuclear receptor signaling bioassays results

  • “SR-XXX” - Stress response bioassays results

please refer to https://tripod.nih.gov/tox21/challenge/data.jsp for details.

Parameters:
  • featurizer (Featurizer or str) – the featurizer to use for processing the data. Alternatively you can pass one of the names from dc.molnet.featurizers as a shortcut.

  • splitter (Splitter or str) – the splitter to use for splitting the data into training, validation, and test sets. Alternatively you can pass one of the names from dc.molnet.splitters as a shortcut. If this is None, all the data will be included in a single dataset.

  • transformers (list of TransformerGenerators or strings) – the Transformers to apply to the data. Each one is specified by a TransformerGenerator or, as a shortcut, one of the names from dc.molnet.transformers.

  • reload (bool) – if True, the first call for a particular featurizer and splitter will cache the datasets to disk, and subsequent calls will reload the cached datasets.

  • data_dir (str) – a directory to save the raw data in

  • save_dir (str) – a directory to save the dataset in

  • tasks (List[str], (optional)) – Specify the set of tasks to load. If no task is specified, then it loads

  • NR-AR (the default set of tasks which are) –

  • NR-AR-LBD

  • NR-AhR

  • NR-Aromatase

  • NR-ER

:param : :param NR-ER-LBD: :param NR-PPAR-gamma: :param SR-ARE: :param SR-ATAD5: :param SR-HSE: :param SR-MMP: :param SR-p53.:

References

Toxcast Datasets

load_toxcast(featurizer: Featurizer | str = 'ECFP', splitter: Splitter | str | None = 'scaffold', transformers: List[TransformerGenerator | str] = ['balancing'], reload: bool = True, data_dir: str | None = None, save_dir: str | None = None, **kwargs) Tuple[List[str], Tuple[Dataset, ...], List[Transformer]][source]

Load Toxcast dataset

ToxCast is an extended data collection from the same initiative as Tox21, providing toxicology data for a large library of compounds based on in vitro high-throughput screening. The processed collection includes qualitative results of over 600 experiments on 8k compounds.

Random splitting is recommended for this dataset.

The raw data csv file contains columns below:

Parameters:
  • featurizer (Featurizer or str) – the featurizer to use for processing the data. Alternatively you can pass one of the names from dc.molnet.featurizers as a shortcut.

  • splitter (Splitter or str) – the splitter to use for splitting the data into training, validation, and test sets. Alternatively you can pass one of the names from dc.molnet.splitters as a shortcut. If this is None, all the data will be included in a single dataset.

  • transformers (list of TransformerGenerators or strings) – the Transformers to apply to the data. Each one is specified by a TransformerGenerator or, as a shortcut, one of the names from dc.molnet.transformers.

  • reload (bool) – if True, the first call for a particular featurizer and splitter will cache the datasets to disk, and subsequent calls will reload the cached datasets.

  • data_dir (str) – a directory to save the raw data in

  • save_dir (str) – a directory to save the dataset in

References

USPTO Datasets

load_uspto(featurizer: Featurizer | str = 'RxnFeaturizer', splitter: Splitter | str | None = None, transformers: List[TransformerGenerator | str] = [], reload: bool = True, data_dir: str | None = None, save_dir: str | None = None, subset: str = 'MIT', sep_reagent: bool = True, **kwargs) Tuple[List[str], Tuple[Dataset, ...], List[Transformer]][source]

Load USPTO Datasets.

The USPTO dataset consists of over 1.8 Million organic chemical reactions extracted from US patents and patent applications. The dataset contains the reactions in the form of reaction SMILES, which have the general format: reactant>reagent>product.

Molnet provides ability to load subsets of the USPTO dataset namely MIT, STEREO and 50K. The MIT dataset contains around 479K reactions, curated by jin et al. The STEREO dataset contains around 1 Million Reactions, it does not have duplicates and the reactions include stereochemical information. The 50K dataset contatins 50,000 reactions and is the benchmark for retrosynthesis predictions. The reactions are additionally classified into 10 reaction classes. The canonicalized version of the dataset used by the loader is the same as that used by Somnath et. al.

The loader uses the SpecifiedSplitter to use the same splits as specified by Schwaller et. al and Dai et. al. Custom splitters could also be used. There is a toggle in the loader to skip the source/target transformation needed for seq2seq tasks. There is an additional toggle to load the dataset with the reagents and reactants separated or mixed. This alters the entries in source by replacing the ‘>’ with ‘.’ , effectively loading them as an unified SMILES string.

Parameters:
  • featurizer (Featurizer or str) – the featurizer to use for processing the data. Alternatively you can pass one of the names from dc.molnet.featurizers as a shortcut.

  • splitter (Splitter or str) – the splitter to use for splitting the data into training, validation, and test sets. Alternatively you can pass one of the names from dc.molnet.splitters as a shortcut. If this is None, all the data will be included in a single dataset.

  • transformers (list of TransformerGenerators or strings) – the Transformers to apply to the data. Each one is specified by a TransformerGenerator or, as a shortcut, one of the names from dc.molnet.transformers.

  • reload (bool) – if True, the first call for a particular featurizer and splitter will cache the datasets to disk, and subsequent calls will reload the cached datasets.

  • data_dir (str) – a directory to save the raw data in

  • save_dir (str) – a directory to save the dataset in

  • subset (str (default 'MIT')) – Subset of dataset to download. ‘FULL’, ‘MIT’, ‘STEREO’, and ‘50K’ are supported.

  • sep_reagent (bool (default True)) – Toggle to load dataset with reactants and reagents either separated or mixed.

  • skip_transform (bool (default True)) – Toggle to skip the source/target transformation.

Returns:

tasks, datasets, transformers

taskslist

Column names corresponding to machine learning target variables.

datasetstuple

train, validation, test splits of data as deepchem.data.datasets.Dataset instances.

transformerslist

deepchem.trans.transformers.Transformer instances applied to dataset.

Return type:

tuple

References

UV Datasets

load_uv(shard_size=2000, featurizer=None, split=None, reload=True)[source]

Load UV dataset; does not do train/test split

The UV dataset is an in-house dataset from Merck that was first introduced in the following paper: Ramsundar, Bharath, et al. “Is multitask deep learning practical for pharma?.” Journal of chemical information and modeling 57.8 (2017): 2068-2076.

The UV dataset tests 10,000 of Merck’s internal compounds on 190 absorption wavelengths between 210 and 400 nm. Unlike most of the other datasets featured in MoleculeNet, the UV collection does not have structures for the compounds tested since they were proprietary Merck compounds. However, the collection does feature pre-computed descriptors for these compounds.

Note that the original train/valid/test split from the source data was preserved here, so this function doesn’t allow for alternate modes of splitting. Similarly, since the source data came pre-featurized, it is not possible to apply alternative featurizations.

Parameters:
  • shard_size (int, optional) – Size of the DiskDataset shards to write on disk

  • featurizer (optional) – Ignored since featurization pre-computed

  • split (optional) – Ignored since split pre-computed

  • reload (bool, optional) – Whether to automatically re-load from disk

ZINC15 Datasets

load_zinc15(featurizer: Featurizer | str = 'OneHot', splitter: Splitter | str | None = 'random', transformers: List[TransformerGenerator | str] = ['normalization'], reload: bool = True, data_dir: str | None = None, save_dir: str | None = None, dataset_size: str = '250K', dataset_dimension: str = '2D', tasks: List[str] = ['mwt', 'logp', 'reactive'], **kwargs) Tuple[List[str], Tuple[Dataset, ...], List[Transformer]][source]

Load zinc15.

ZINC15 is a dataset of over 230 million purchasable compounds for virtual screening of small molecules to identify structures that are likely to bind to drug targets. ZINC15 data is currently available in 2D (SMILES string) format.

MolNet provides subsets of 250K, 1M, and 10M “lead-like” compounds from ZINC15. The full dataset of 270M “goldilocks” compounds is also available. Compounds in ZINC15 are labeled by their molecular weight and LogP (solubility) values. Each compound also has information about how readily available (purchasable) it is and its reactivity. Lead-like compounds have molecular weight between 300 and 350 Daltons and LogP between -1 and 3.5. Goldilocks compounds are lead-like compounds with LogP values further restricted to between 2 and 3.

If reload = True and data_dir (save_dir) is specified, the loader will attempt to load the raw dataset (featurized dataset) from disk. Otherwise, the dataset will be downloaded from the DeepChem AWS bucket.

For more information on ZINC15, please see [1]_ and https://zinc15.docking.org/.

Parameters:
  • featurizer (Featurizer or str) – the featurizer to use for processing the data. Alternatively you can pass one of the names from dc.molnet.featurizers as a shortcut.

  • splitter (Splitter or str) – the splitter to use for splitting the data into training, validation, and test sets. Alternatively you can pass one of the names from dc.molnet.splitters as a shortcut. If this is None, all the data will be included in a single dataset.

  • transformers (list of TransformerGenerators or strings) – the Transformers to apply to the data. Each one is specified by a TransformerGenerator or, as a shortcut, one of the names from dc.molnet.transformers.

  • reload (bool) – if True, the first call for a particular featurizer and splitter will cache the datasets to disk, and subsequent calls will reload the cached datasets.

  • data_dir (str) – a directory to save the raw data in

  • save_dir (str) – a directory to save the dataset in

  • size (str (default '250K')) – Size of dataset to download. ‘250K’, ‘1M’, ‘10M’, and ‘270M’ are supported.

  • format (str (default '2D')) – Format of data to download. 2D SMILES strings or 3D SDF files.

  • tasks (List[str], (optional) default: [‘molwt’, ‘logp’, ‘reactive’]) – Specify the set of tasks to load. If no task is specified, then it loads

  • molwt (the default set of tasks which are) –

  • logp

  • reactive.

Returns:

tasks, datasets, transformers

taskslist

Column names corresponding to machine learning target variables.

datasetstuple

train, validation, test splits of data as deepchem.data.datasets.Dataset instances.

transformerslist

deepchem.trans.transformers.Transformer instances applied to dataset.

Return type:

tuple

Notes

The total ZINC dataset with SMILES strings contains hundreds of millions of compounds and is over 100GB! ZINC250K is recommended for experimentation. The full set of 270M goldilocks compounds is 23GB.

References

Platinum Adsorption Dataset

load_Platinum_Adsorption(featurizer: ~deepchem.feat.base_classes.Featurizer | str = SineCoulombMatrix[max_atoms=100, flatten=True], splitter: ~deepchem.splits.splitters.Splitter | str | None = 'random', transformers: ~typing.List[~deepchem.molnet.load_function.molnet_loader.TransformerGenerator | str] = [], reload: bool = True, data_dir: str | None = None, save_dir: str | None = None, **kwargs) Tuple[List[str], Tuple[Dataset, ...], List[Transformer]][source]

Load Platinum Adsorption Dataset

The dataset consist of diffrent configurations of Adsorbates (i.e N and NO) on Platinum surface represented as Lattice and their formation energy. There are 648 diffrent adsorbate configuration in this datasets represented as Pymatgen Structure objects.

  1. Pymatgen structure object with site_properties with following key value.
    • “SiteTypes”, mentioning if it is a active site “A1” or spectator

      site “S1”.

    • “oss”, diffrent occupational sites. For spectator sites make it -1.

Parameters:
  • featurizer (Featurizer (default LCNNFeaturizer)) – the featurizer to use for processing the data. Reccomended to use the LCNNFeaturiser.

  • splitter (Splitter (default RandomSplitter)) – the splitter to use for splitting the data into training, validation, and test sets. Alternatively you can pass one of the names from dc.molnet.splitters as a shortcut. If this is None, all the data will be included in a single dataset.

  • transformers (list of TransformerGenerators or strings. the Transformers to) – apply to the data and appropritate featuriser. Does’nt require any transformation for LCNN_featuriser

  • reload (bool) – if True, the first call for a particular featurizer and splitter will cache the datasets to disk, and subsequent calls will reload the cached datasets.

  • data_dir (str) – a directory to save the raw data in

  • save_dir (str, optional (default None)) – a directory to save the dataset in

References

Examples

>>>
>> import deepchem as dc
>> tasks, datasets, transformers = load_Platinum_Adsorption(
>>    reload=True,
>>    data_dir=data_path,
>>    save_dir=data_path,
>>    featurizer_kwargs=feat_args)
>> train_dataset, val_dataset, test_dataset = datasets