MoleculeNet

The DeepChem library is packaged alongside the MoleculeNet suite of datasets. One of the most important parts of machine learning applications is finding a suitable dataset. The MoleculeNet suite has curated a whole range of datasets and loaded them into DeepChem dc.data.Dataset objects for convenience.

Contributing a new dataset to MoleculeNet

If you are proposing a new dataset to be included in the MoleculeNet benchmarking suite, please follow the instructions below. Please review the datasets already available in MolNet before contributing.

  1. Read the Contribution guidelines.
  2. Open an issue to discuss the dataset you want to add to MolNet.
  3. Implement a function in the deepchem.molnet.load_function module following the template function deepchem.molnet.load_function.load_dataset_template. Specify which featurizers, transformers, and splitters (available from deepchem.molnet.defaults) are supported for your dataset.
  4. Add your load function to deepchem.molnet.__init__.py for easy importing.
  5. Prepare your dataset as a .tar.gz or .zip file. Accepted filetypes include CSV, JSON, and SDF.
  6. Ask a member of the technical steering committee to add your .tar.gz or .zip file to the DeepChem AWS bucket. Modify your load function to pull down the dataset from AWS.
  7. Submit a [WIP] PR (Work in progress pull request) following the PR template.

BACE Dataset

deepchem.molnet.load_bace_classification(featurizer='ECFP', split='random', reload=True, data_dir=None, save_dir=None, **kwargs)[source]

Load BACE dataset, classification labels

BACE dataset with classification labels (“class”).

deepchem.molnet.load_bace_regression(featurizer='ECFP', split='random', reload=True, move_mean=True, data_dir=None, save_dir=None, **kwargs)[source]

Load BACE dataset, regression labels

The BACE dataset provides quantitative IC50 and qualitative (binary label) binding results for a set of inhibitors of human beta-secretase 1 (BACE-1).

All data are experimental values reported in scientific literature over the past decade, some with detailed crystal structures available. A collection of 1522 compounds is provided, along with the regression labels of IC50.

Scaffold splitting is recommended for this dataset.

The raw data csv file contains columns below:

  • “mol” - SMILES representation of the molecular structure
  • “pIC50” - Negative log of the IC50 binding affinity
  • “class” - Binary labels for inhibitor

References

[1]Subramanian, Govindan, et al. “Computational modeling of β-secretase 1 (BACE-1) inhibitors using ligand based approaches.” Journal of chemical information and modeling 56.10 (2016): 1936-1949.

BBBC Datasets

deepchem.molnet.load_bbbc001(split='index', reload=True, data_dir=None, save_dir=None, **kwargs)[source]

Load BBBC001 dataset

This dataset contains 6 images of human HT29 colon cancer cells. The task is to learn to predict the cell counts in these images. This dataset is too small to serve to train algorithms, but might serve as a good test dataset. https://data.broadinstitute.org/bbbc/BBBC001/

deepchem.molnet.load_bbbc002(split='index', reload=True, data_dir=None, save_dir=None, **kwargs)[source]

Load BBBC002 dataset

This dataset contains data corresponding to 5 samples of Drosophilia Kc167 cells. There are 10 fields of view for each sample, each an image of size 512x512. Ground truth labels contain cell counts for this dataset. Full details about this dataset are present at https://data.broadinstitute.org/bbbc/BBBC002/.

BBBP Datasets

BBBP stands for Blood-Brain-Barrier Penetration

deepchem.molnet.load_bbbp(featurizer='ECFP', split='random', reload=True, data_dir=None, save_dir=None, **kwargs)[source]

Load BBBP dataset

The blood-brain barrier penetration (BBBP) dataset is designed for the modeling and prediction of barrier permeability. As a membrane separating circulating blood and brain extracellular fluid, the blood-brain barrier blocks most drugs, hormones and neurotransmitters. Thus penetration of the barrier forms a long-standing issue in development of drugs targeting central nervous system.

This dataset includes binary labels for over 2000 compounds on their permeability properties.

Scaffold splitting is recommended for this dataset.

The raw data csv file contains columns below:

  • “name” - Name of the compound
  • “smiles” - SMILES representation of the molecular structure
  • “p_np” - Binary labels for penetration/non-penetration

References

[1]Martins, Ines Filipa, et al. “A Bayesian approach to in silico blood-brain barrier penetration modeling.” Journal of chemical information and modeling 52.6 (2012): 1686-1697.

Cell Counting Datasets

deepchem.molnet.load_cell_counting(split=None, reload=True, data_dir=None, save_dir=None, **kwargs)[source]

Load Cell Counting dataset.

Loads the cell counting dataset from http://www.robots.ox.ac.uk/~vgg/research/counting/index_org.html.

Chembl Datasets

deepchem.molnet.load_chembl(shard_size=2000, featurizer='ECFP', set='5thresh', split='random', reload=True, data_dir=None, save_dir=None, **kwargs)[source]

Chembl25 Datasets

deepchem.molnet.load_chembl25(featurizer='smiles2seq', split='random', data_dir=None, save_dir=None, split_seed=None, reload=True, transformer_type='minmax', **kwargs)[source]

Loads the ChEMBL25 dataset, featurizes it, and does a split. :param featurizer: Featurizer to use :type featurizer: str, default smiles2seq :param split: Splitter to use :type split: str, default None :param data_dir: Directory to download data to, or load dataset from. (TODO: If None, make tmp) :type data_dir: str, default None :param save_dir: Directory to save the featurized dataset to. (TODO: If None, make tmp) :type save_dir: str, default None :param split_seed: Seed to be used for splitting the dataset :type split_seed: int, default None :param reload: Whether to reload saved dataset :type reload: bool, default True :param transformer_type: Transformer to use :type transformer_type: str, default minmax:

Clearance Datasets

deepchem.molnet.load_clearance(featurizer='ECFP', split='random', reload=True, move_mean=True, data_dir=None, save_dir=None, **kwargs)[source]

Load clearance datasets.

Clintox Datasets

deepchem.molnet.load_clintox(featurizer='ECFP', split='index', reload=True, data_dir=None, save_dir=None, **kwargs)[source]

Load ClinTox dataset

The ClinTox dataset compares drugs approved by the FDA and drugs that have failed clinical trials for toxicity reasons. The dataset includes two classification tasks for 1491 drug compounds with known chemical structures:

  1. clinical trial toxicity (or absence of toxicity)
  2. FDA approval status.

List of FDA-approved drugs are compiled from the SWEETLEAD database, and list of drugs that failed clinical trials for toxicity reasons are compiled from the Aggregate Analysis of ClinicalTrials.gov(AACT) database.

Random splitting is recommended for this dataset.

The raw data csv file contains columns below:

  • “smiles” - SMILES representation of the molecular structure
  • “FDA_APPROVED” - FDA approval status
  • “CT_TOX” - Clinical trial results

References

[1]Gayvert, Kaitlyn M., Neel S. Madhukar, and Olivier Elemento. “A data-driven approach to predicting successes and failures of clinical trials.” Cell chemical biology 23.10 (2016): 1294-1301.
[2]Artemov, Artem V., et al. “Integrated deep learned transcriptomic and structure-based predictor of clinical trials outcomes.” bioRxiv (2016): 095653.
[3]Novick, Paul A., et al. “SWEETLEAD: an in silico database of approved drugs, regulated chemicals, and herbal isolates for computer-aided drug discovery.” PloS one 8.11 (2013): e79568.
[4]Aggregate Analysis of ClincalTrials.gov (AACT) Database. https://www.ctti-clinicaltrials.org/aact-database

Delaney Datasets

deepchem.molnet.load_delaney(featurizer='ECFP', split='index', reload=True, move_mean=True, data_dir=None, save_dir=None, **kwargs)[source]

Load delaney dataset

The Delaney(ESOL) dataset a regression dataset containing structures and water solubility data for 1128 compounds. The dataset is widely used to validate machine learning models on estimating solubility directly from molecular structures (as encoded in SMILES strings).

Random splitting is recommended for this dataset.

The raw data csv file contains columns below:

  • “Compound ID” - Name of the compound
  • “smiles” - SMILES representation of the molecular structure
  • “measured log solubility in mols per litre” - Log-scale water solubility of the compound, used as label

References

[1]Delaney, John S. “ESOL: estimating aqueous solubility directly from molecular structure.” Journal of chemical information and computer sciences 44.3 (2004): 1000-1005.

Factors Datasets

deepchem.molnet.load_factors(shard_size=2000, featurizer=None, split=None, reload=True)[source]

Loads FACTOR dataset; does not do train/test split

The Factors dataset is an in-house dataset from Merck that was first introduced in the following paper: Ramsundar, Bharath, et al. “Is multitask deep learning practical for pharma?.” Journal of chemical information and modeling 57.8 (2017): 2068-2076.

It contains 1500 Merck in-house compounds that were measured for IC50 of inhibition on 12 serine proteases. Unlike most of the other datasets featured in MoleculeNet, the Factors collection does not have structures for the compounds tested since they were proprietary Merck compounds. However, the collection does feature pre-computed descriptors for these compounds.

Note that the original train/valid/test split from the source data was preserved here, so this function doesn’t allow for alternate modes of splitting. Similarly, since the source data came pre-featurized, it is not possible to apply alternative featurizations.

Parameters:
  • shard_size (int, optional) – Size of the DiskDataset shards to write on disk
  • featurizer (optional) – Ignored since featurization pre-computed
  • split (optional) – Ignored since split pre-computed
  • reload (bool, optional) – Whether to automatically re-load from disk

HIV Datasets

deepchem.molnet.load_hiv(featurizer='ECFP', split='index', reload=True, data_dir=None, save_dir=None, **kwargs)[source]

Load HIV dataset

The HIV dataset was introduced by the Drug Therapeutics Program (DTP) AIDS Antiviral Screen, which tested the ability to inhibit HIV replication for over 40,000 compounds. Screening results were evaluated and placed into three categories: confirmed inactive (CI),confirmed active (CA) and confirmed moderately active (CM). We further combine the latter two labels, making it a classification task between inactive (CI) and active (CA and CM).

Scaffold splitting is recommended for this dataset.

The raw data csv file contains columns below:

  • “smiles”: SMILES representation of the molecular structure
  • “activity”: Three-class labels for screening results: CI/CM/CA
  • “HIV_active”: Binary labels for screening results: 1 (CA/CM) and 0 (CI)

References

[1]AIDS Antiviral Screen Data. https://wiki.nci.nih.gov/display/NCIDTPdata/AIDS+Antiviral+Screen+Data

HOPV Datasets

HOPV stands for the Harvard Organic Photovoltaic Dataset.

deepchem.molnet.load_hopv(featurizer='ECFP', split='index', reload=True, data_dir=None, save_dir=None, **kwargs)[source]

Load HOPV datasets. Does not do train/test split

The HOPV datasets consist of the “Harvard Organic Photovoltaic Dataset. This dataset includes 350 small molecules and polymers that were utilized as p-type materials in OPVs. Experimental properties include: HOMO [a.u.], LUMO [a.u.], Electrochemical gap [a.u.], Optical gap [a.u.], Power conversion efficiency [%], Open circuit potential [V], Short circuit current density [mA/cm^2], and fill factor [%]. Theoretical calculations in the original dataset have been removed (for now).

Lopez, Steven A., et al. “The Harvard organic photovoltaic dataset.” Scientific data 3.1 (2016): 1-7.

HPPB Datasets

deepchem.molnet.load_hppb(featurizer='ECFP', data_dir=None, save_dir=None, split=None, split_seed=None, reload=True, **kwargs)[source]

Loads the thermodynamic solubility datasets.

KAGGLE Datasets

deepchem.molnet.load_kaggle(shard_size=2000, featurizer=None, split=None, reload=True)[source]

Loads kaggle datasets. Generates if not stored already.

The Kaggle dataset is an in-house dataset from Merck that was first introduced in the following paper:

Ma, Junshui, et al. “Deep neural nets as a method for quantitative structure–activity relationships.” Journal of chemical information and modeling 55.2 (2015): 263-274.

It contains 100,000 unique Merck in-house compounds that were measured on 15 enzyme inhibition and ADME/TOX datasets. Unlike most of the other datasets featured in MoleculeNet, the Kaggle collection does not have structures for the compounds tested since they were proprietary Merck compounds. However, the collection does feature pre-computed descriptors for these compounds.

Note that the original train/valid/test split from the source data was preserved here, so this function doesn’t allow for alternate modes of splitting. Similarly, since the source data came pre-featurized, it is not possible to apply alternative featurizations.

Parameters:
  • shard_size (int, optional) – Size of the DiskDataset shards to write on disk
  • featurizer (optional) – Ignored since featurization pre-computed
  • split (optional) – Ignored since split pre-computed
  • reload (bool, optional) – Whether to automatically re-load from disk

Kinase Datasets

deepchem.molnet.load_kinase(shard_size=2000, featurizer=None, split=None, reload=True)[source]

Loads Kinase datasets, does not do train/test split

The Kinase dataset is an in-house dataset from Merck that was first introduced in the following paper: Ramsundar, Bharath, et al. “Is multitask deep learning practical for pharma?.” Journal of chemical information and modeling 57.8 (2017): 2068-2076.

It contains 2500 Merck in-house compounds that were measured for IC50 of inhibition on 99 protein kinases. Unlike most of the other datasets featured in MoleculeNet, the Kinase collection does not have structures for the compounds tested since they were proprietary Merck compounds. However, the collection does feature pre-computed descriptors for these compounds.

Note that the original train/valid/test split from the source data was preserved here, so this function doesn’t allow for alternate modes of splitting. Similarly, since the source data came pre-featurized, it is not possible to apply alternative featurizations.

Parameters:
  • shard_size (int, optional) – Size of the DiskDataset shards to write on disk
  • featurizer (optional) – Ignored since featurization pre-computed
  • split (optional) – Ignored since split pre-computed
  • reload (bool, optional) – Whether to automatically re-load from disk

Lipo Datasets

deepchem.molnet.load_lipo(featurizer='ECFP', split='index', reload=True, move_mean=True, data_dir=None, save_dir=None, **kwargs)[source]

Load Lipophilicity dataset

Lipophilicity is an important feature of drug molecules that affects both membrane permeability and solubility. The lipophilicity dataset, curated from ChEMBL database, provides experimental results of octanol/water distribution coefficient (logD at pH 7.4) of 4200 compounds.

Random splitting is recommended for this dataset.

The raw data csv file contains columns below:

  • “smiles” - SMILES representation of the molecular structure
  • “exp” - Measured octanol/water distribution coefficient (logD) of the compound, used as label

References

[1]Hersey, A. ChEMBL Deposited Data Set - AZ dataset; 2015. https://doi.org/10.6019/chembl3301361

Materials Datasets

Materials datasets include inorganic crystal structures, chemical compositions, and target properties like formation energies and band gaps. Machine learning problems in materials science commonly include predicting the value of a continuous (regression) or categorical (classification) property of a material based on its chemical composition or crystal structure. “Inverse design” is also of great interest, in which ML methods generate crystal structures that have a desired property. Other areas where ML is applicable in materials include: discovering new or modified phenomenological models that describe material behavior

deepchem.molnet.load_bandgap(featurizer=<class 'deepchem.feat.material_featurizers.element_property_fingerprint.ElementPropertyFingerprint'>, transformers: List[T] = [<class 'deepchem.trans.transformers.NormalizationTransformer'>], splitter=<class 'deepchem.splits.splitters.RandomSplitter'>, reload: bool = True, data_dir: Optional[str] = None, save_dir: Optional[str] = None, featurizer_kwargs: Dict[str, Any] = {}, splitter_kwargs: Dict[str, Any] = {'frac_test': 0.1, 'frac_train': 0.8, 'frac_valid': 0.1}, transformer_kwargs: Dict[str, Dict[str, Any]] = {'NormalizationTransformer': {'transform_X': True}}, **kwargs) → Tuple[List[T], Optional[Tuple], List[T]][source]

Load band gap dataset.

Contains 4604 experimentally measured band gaps for inorganic crystal structure compositions. In benchmark studies, random forest models achieved a mean average error of 0.45 eV during five-fold nested cross validation on this dataset.

For more details on the dataset see [1]_. For more details on previous benchmarks for this dataset, see [2]_.

Parameters:
  • featurizer (MaterialCompositionFeaturizer (default ElementPropertyFingerprint)) – A featurizer that inherits from deepchem.feat.Featurizer.
  • transformers (List[Transformer]) – A transformer that inherits from deepchem.trans.Transformer.
  • splitter (Splitter (default RandomSplitter)) – A splitter that inherits from deepchem.splits.splitters.Splitter.
  • reload (bool (default True)) – Try to reload dataset from disk if already downloaded. Save to disk after featurizing.
  • data_dir (str, optional (default None)) – Path to datasets.
  • save_dir (str, optional (default None)) – Path to featurized datasets.
  • featurizer_kwargs (Dict[str, Any]) – Specify parameters to featurizer, e.g. {“size”: 1024}
  • splitter_kwargs (Dict[str, Any]) – Specify parameters to splitter, e.g. {“seed”: 42}
  • transformer_kwargs (dict) – Maps transformer names to constructor arguments, e.g. {“BalancingTransformer”: {“transform_x”:True, “transform_y”:False}}
  • **kwargs (additional optional arguments.) –
Returns:

tasks, datasets, transformers

tasks : list

Column names corresponding to machine learning target variables.

datasets : tuple

train, validation, test splits of data as deepchem.data.datasets.Dataset instances.

transformers : list

deepchem.trans.transformers.Transformer instances applied to dataset.

Return type:

tuple

References

[1]Zhuo, Y. et al. “Predicting the Band Gaps of Inorganic Solids by Machine Learning.” J. Phys. Chem. Lett. (2018) DOI: 10.1021/acs.jpclett.8b00124.
[2]Dunn, A. et al. “Benchmarking Materials Property Prediction Methods: The Matbench Test Set and Automatminer Reference Algorithm.” https://arxiv.org/abs/2005.00707 (2020)

Examples

>> import deepchem as dc >> tasks, datasets, transformers = dc.molnet.load_bandgap(reload=False) >> train_dataset, val_dataset, test_dataset = datasets >> n_tasks = len(tasks) >> n_features = train_dataset.get_data_shape()[0] >> model = dc.models.MultitaskRegressor(n_tasks, n_features)

deepchem.molnet.load_perovskite(featurizer=<class 'deepchem.feat.material_featurizers.sine_coulomb_matrix.SineCoulombMatrix'>, transformers: List[T] = [<class 'deepchem.trans.transformers.NormalizationTransformer'>], splitter=<class 'deepchem.splits.splitters.RandomSplitter'>, reload: bool = True, data_dir: Optional[str] = None, save_dir: Optional[str] = None, featurizer_kwargs: Dict[str, Any] = {}, splitter_kwargs: Dict[str, Any] = {'frac_test': 0.1, 'frac_train': 0.8, 'frac_valid': 0.1}, transformer_kwargs: Dict[str, Dict[str, Any]] = {'NormalizationTransformer': {'transform_X': True}}, **kwargs) → Tuple[List[T], Optional[Tuple], List[T]][source]

Load perovskite dataset.

Contains 18928 perovskite structures and their formation energies. In benchmark studies, random forest models and crystal graph neural networks achieved mean average error of 0.23 and 0.05 eV/atom, respectively, during five-fold nested cross validation on this dataset.

For more details on the dataset see [1]_. For more details on previous benchmarks for this dataset, see [2]_.

Parameters:
  • featurizer (MaterialStructureFeaturizer (default SineCoulombMatrix)) – A featurizer that inherits from deepchem.feat.Featurizer.
  • transformers (List[Transformer]) – A transformer that inherits from deepchem.trans.Transformer.
  • splitter (Splitter (default RandomSplitter)) – A splitter that inherits from deepchem.splits.splitters.Splitter.
  • reload (bool (default True)) – Try to reload dataset from disk if already downloaded. Save to disk after featurizing.
  • data_dir (str, optional (default None)) – Path to datasets.
  • save_dir (str, optional (default None)) – Path to featurized datasets.
  • featurizer_kwargs (Dict[str, Any]) – Specify parameters to featurizer, e.g. {“size”: 1024}
  • splitter_kwargs (Dict[str, Any]) – Specify parameters to splitter, e.g. {“seed”: 42}
  • transformer_kwargs (dict) – Maps transformer names to constructor arguments, e.g. {“BalancingTransformer”: {“transform_x”:True, “transform_y”:False}}
  • **kwargs (additional optional arguments.) –
Returns:

tasks, datasets, transformers

tasks : list

Column names corresponding to machine learning target variables.

datasets : tuple

train, validation, test splits of data as deepchem.data.datasets.Dataset instances.

transformers : list

deepchem.trans.transformers.Transformer instances applied to dataset.

Return type:

tuple

References

[1]Castelli, I. et al. “New cubic perovskites for one- and two-photon water splitting using the computational materials repository.” Energy Environ. Sci., (2012), 5, 9034-9043 DOI: 10.1039/C2EE22341D.
[2]Dunn, A. et al. “Benchmarking Materials Property Prediction Methods: The Matbench Test Set and Automatminer Reference Algorithm.” https://arxiv.org/abs/2005.00707 (2020)

Examples

>> import deepchem as dc >> tasks, datasets, transformers = dc.molnet.load_perovskite(reload=False) >> train_dataset, val_dataset, test_dataset = datasets >> n_tasks = len(tasks) >> n_features = train_dataset.get_data_shape()[0] >> model = dc.models.MultitaskRegressor(n_tasks, n_features)

deepchem.molnet.load_mp_formation_energy(featurizer=<class 'deepchem.feat.material_featurizers.sine_coulomb_matrix.SineCoulombMatrix'>, transformers: List[T] = [<class 'deepchem.trans.transformers.NormalizationTransformer'>], splitter=<class 'deepchem.splits.splitters.RandomSplitter'>, reload: bool = True, data_dir: Optional[str] = None, save_dir: Optional[str] = None, featurizer_kwargs: Dict[str, Any] = {}, splitter_kwargs: Dict[str, Any] = {'frac_test': 0.1, 'frac_train': 0.8, 'frac_valid': 0.1}, transformer_kwargs: Dict[str, Dict[str, Any]] = {'NormalizationTransformer': {'transform_X': True}}, **kwargs) → Tuple[List[T], Optional[Tuple], List[T]][source]

Load mp formation energy dataset.

Contains 132752 calculated formation energies and inorganic crystal structures from the Materials Project database. In benchmark studies, random forest models achieved a mean average error of 0.116 eV/atom during five-folded nested cross validation on this dataset.

For more details on the dataset see [1]_. For more details on previous benchmarks for this dataset, see [2]_.

Parameters:
  • featurizer (MaterialStructureFeaturizer (default SineCoulombMatrix)) – A featurizer that inherits from deepchem.feat.Featurizer.
  • transformers (List[Transformer]) – A transformer that inherits from deepchem.trans.Transformer.
  • splitter (Splitter (default RandomSplitter)) – A splitter that inherits from deepchem.splits.splitters.Splitter.
  • reload (bool (default True)) – Try to reload dataset from disk if already downloaded. Save to disk after featurizing.
  • data_dir (str, optional (default None)) – Path to datasets.
  • save_dir (str, optional (default None)) – Path to featurized datasets.
  • featurizer_kwargs (Dict[str, Any]) – Specify parameters to featurizer, e.g. {“size”: 1024}
  • splitter_kwargs (Dict[str, Any]) – Specify parameters to splitter, e.g. {“seed”: 42}
  • transformer_kwargs (dict) – Maps transformer names to constructor arguments, e.g. {“BalancingTransformer”: {“transform_X”:True, “transform_y”:False}}
  • **kwargs (additional optional arguments.) –
Returns:

tasks, datasets, transformers

tasks : list

Column names corresponding to machine learning target variables.

datasets : tuple

train, validation, test splits of data as deepchem.data.datasets.Dataset instances.

transformers : list

deepchem.trans.transformers.Transformer instances applied to dataset.

Return type:

tuple

References

[1]A. Jain*, S.P. Ong*, et al. (*=equal contributions) The Materials Project: A materials genome approach to accelerating materials innovation APL Materials, 2013, 1(1), 011002. doi:10.1063/1.4812323 (2013).
[2]Dunn, A. et al. “Benchmarking Materials Property Prediction Methods: The Matbench Test Set and Automatminer Reference Algorithm.” https://arxiv.org/abs/2005.00707 (2020)

Examples

>> import deepchem as dc >> tasks, datasets, transformers = dc.molnet.load_mp_formation_energy(reload=False) >> train_dataset, val_dataset, test_dataset = datasets >> n_tasks = len(tasks) >> n_features = train_dataset.get_data_shape()[0] >> model = dc.models.MultitaskRegressor(n_tasks, n_features)

deepchem.molnet.load_mp_metallicity(featurizer=<class 'deepchem.feat.material_featurizers.sine_coulomb_matrix.SineCoulombMatrix'>, transformers: List[T] = [<class 'deepchem.trans.transformers.NormalizationTransformer'>], splitter=<class 'deepchem.splits.splitters.RandomSplitter'>, reload: bool = True, data_dir: Optional[str] = None, save_dir: Optional[str] = None, featurizer_kwargs: Dict[str, Any] = {}, splitter_kwargs: Dict[str, Any] = {'frac_test': 0.1, 'frac_train': 0.8, 'frac_valid': 0.1}, transformer_kwargs: Dict[str, Dict[str, Any]] = {'NormalizationTransformer': {'transform_X': True}}, **kwargs) → Tuple[List[T], Optional[Tuple], List[T]][source]

Load mp formation energy dataset.

Contains 106113 inorganic crystal structures from the Materials Project database labeled as metals or nonmetals. In benchmark studies, random forest models achieved a mean ROC-AUC of 0.9 during five-folded nested cross validation on this dataset.

For more details on the dataset see [1]_. For more details on previous benchmarks for this dataset, see [2]_.

Parameters:
  • featurizer (MaterialStructureFeaturizer (default SineCoulombMatrix)) – A featurizer that inherits from deepchem.feat.Featurizer.
  • transformers (List[Transformer]) – A transformer that inherits from deepchem.trans.Transformer.
  • splitter (Splitter (default RandomSplitter)) – A splitter that inherits from deepchem.splits.splitters.Splitter.
  • reload (bool (default True)) – Try to reload dataset from disk if already downloaded. Save to disk after featurizing.
  • data_dir (str, optional (default None)) – Path to datasets.
  • save_dir (str, optional (default None)) – Path to featurized datasets.
  • featurizer_kwargs (Dict[str, Any]) – Specify parameters to featurizer, e.g. {“size”: 1024}
  • splitter_kwargs (Dict[str, Any]) – Specify parameters to splitter, e.g. {“seed”: 42}
  • transformer_kwargs (dict) – Maps transformer names to constructor arguments, e.g. {“BalancingTransformer”: {“transform_x”:True, “transform_y”:False}}
  • **kwargs (additional optional arguments.) –
Returns:

tasks, datasets, transformers

tasks : list

Column names corresponding to machine learning target variables.

datasets : tuple

train, validation, test splits of data as deepchem.data.datasets.Dataset instances.

transformers : list

deepchem.trans.transformers.Transformer instances applied to dataset.

Return type:

tuple

References

[1]A. Jain*, S.P. Ong*, et al. (*=equal contributions) The Materials Project: A materials genome approach to accelerating materials innovation APL Materials, 2013, 1(1), 011002. doi:10.1063/1.4812323 (2013).
[2]Dunn, A. et al. “Benchmarking Materials Property Prediction Methods: The Matbench Test Set and Automatminer Reference Algorithm.” https://arxiv.org/abs/2005.00707 (2020)

Examples

>> import deepchem as dc >> tasks, datasets, transformers = dc.molnet.load_mp_metallicity(reload=False) >> train_dataset, val_dataset, test_dataset = datasets >> n_tasks = len(tasks) >> n_features = train_dataset.get_data_shape()[0] >> model = dc.models.MultitaskRegressor(n_tasks, n_features)

MUV Datasets

deepchem.molnet.load_muv(featurizer='ECFP', split='index', reload=True, K=4, data_dir=None, save_dir=None, **kwargs)[source]

Load MUV dataset

The Maximum Unbiased Validation (MUV) group is a benchmark dataset selected from PubChem BioAssay by applying a refined nearest neighbor analysis.

The MUV dataset contains 17 challenging tasks for around 90 thousand compounds and is specifically designed for validation of virtual screening techniques.

Random splitting is recommended for this dataset.

The raw data csv file contains columns below:

  • “mol_id” - PubChem CID of the compound
  • “smiles” - SMILES representation of the molecular structure
  • “MUV-XXX” - Measured results (Active/Inactive) for bioassays

References

[1]Rohrer, Sebastian G., and Knut Baumann. “Maximum unbiased validation (MUV) data sets for virtual screening based on PubChem bioactivity data.” Journal of chemical information and modeling 49.2 (2009): 169-184.

NCI Datasets

deepchem.molnet.load_nci(featurizer='ECFP', shard_size=1000, split='random', reload=True, data_dir=None, save_dir=None, **kwargs)[source]

PCBA Datasets

deepchem.molnet.load_pcba(featurizer='ECFP', split='random', reload=True, data_dir=None, save_dir=None, **kwargs)[source]

PDBBIND Datasets

deepchem.molnet.load_pdbbind(reload=True, data_dir=None, subset='core', load_binding_pocket=False, featurizer='grid', split='random', split_seed=None, save_dir=None, save_timestamp=False)[source]

Load raw PDBBind dataset by featurization and split.

Parameters:
  • reload (Bool, optional) – Reload saved featurized and splitted dataset or not.
  • data_dir (Str, optional) – Specifies the directory storing the raw dataset.
  • load_binding_pocket (Bool, optional) – Load binding pocket or full protein.
  • subset (Str) – Specifies which subset of PDBBind, only “core” or “refined” for now.
  • featurizer (Str) – Either “grid” or “atomic” for grid and atomic featurizations.
  • split (Str) – Either “random” or “index”.
  • split_seed (Int, optional) – Specifies the random seed for splitter.
  • save_dir (Str, optional) – Specifies the directory to store the featurized and splitted dataset when reload is False. If reload is True, it will load saved dataset inside save_dir.
  • save_timestamp (Bool, optional) – Save featurized and splitted dataset with timestamp or not. Set it as True when running similar or same jobs simultaneously on multiple compute nodes.

PPB Datasets

deepchem.molnet.load_ppb(featurizer='ECFP', split='index', reload=True, data_dir=None, save_dir=None, **kwargs)[source]

Load PPB datasets.

QM7 Datasets

deepchem.molnet.load_qm7(featurizer='CoulombMatrix', split='random', reload=True, move_mean=True, data_dir=None, save_dir=None, **kwargs)[source]

Load QM7 dataset

QM7 is a subset of GDB-13 (a database of nearly 1 billion stable and synthetically accessible organic molecules) containing up to 7 heavy atoms C, N, O, and S. The 3D Cartesian coordinates of the most stable conformations and their atomization energies were determined using ab-initio density functional theory (PBE0/tier2 basis set). This dataset also provided Coulomb matrices as calculated in [Rupp et al. PRL, 2012]:

Stratified splitting is recommended for this dataset.

The data file (.mat format, we recommend using scipy.io.loadmat for python users to load this original data) contains five arrays:

  • “X” - (7165 x 23 x 23), Coulomb matrices
  • “T” - (7165), atomization energies (unit: kcal/mol)
  • “P” - (5 x 1433), cross-validation splits as used in [Montavon et al. NIPS, 2012]
  • “Z” - (7165 x 23), atomic charges
  • “R” - (7165 x 23 x 3), cartesian coordinate (unit: Bohr) of each atom in the molecules

References

[1]Rupp, Matthias, et al. “Fast and accurate modeling of molecular atomization energies with machine learning.” Physical review letters 108.5 (2012): 058301.
[2]Montavon, Grégoire, et al. “Learning invariant representations of molecules for atomization energy prediction.” Advances in Neural Information Proccessing Systems. 2012.
deepchem.molnet.load_qm7_from_mat(featurizer='CoulombMatrix', split='stratified', reload=True, move_mean=True, data_dir=None, save_dir=None, **kwargs)[source]
deepchem.molnet.load_qm7b_from_mat(featurizer='CoulombMatrix', split='stratified', reload=True, move_mean=True, data_dir=None, save_dir=None, **kwargs)[source]

Load QM7B dataset

QM7b is an extension for the QM7 dataset with additional properties predicted at different levels (ZINDO, SCS, PBE0, GW). In total 14 tasks are included for 7211 molecules with up to 7 heavy atoms.

Random splitting is recommended for this dataset.

The data file (.mat format, we recommend using scipy.io.loadmat for python users to load this original data) contains two arrays:

  • “X” - (7211 x 23 x 23), Coulomb matrices
  • “T” - (7211 x 14), properties:
    1. Atomization energies E (PBE0, unit: kcal/mol)
    2. Excitation of maximal optimal absorption E_max (ZINDO, unit: eV)
    3. Absorption Intensity at maximal absorption I_max (ZINDO)
    4. Highest occupied molecular orbital HOMO (ZINDO, unit: eV)
    5. Lowest unoccupied molecular orbital LUMO (ZINDO, unit: eV)
    6. First excitation energy E_1st (ZINDO, unit: eV)
    7. Ionization potential IP (ZINDO, unit: eV)
    8. Electron affinity EA (ZINDO, unit: eV)
    9. Highest occupied molecular orbital HOMO (PBE0, unit: eV)
    10. Lowest unoccupied molecular orbital LUMO (PBE0, unit: eV)
    11. Highest occupied molecular orbital HOMO (GW, unit: eV)
    12. Lowest unoccupied molecular orbital LUMO (GW, unit: eV)
    13. Polarizabilities α (PBE0, unit: Å^3)
    14. Polarizabilities α (SCS, unit: Å^3)

References

[1]Blum, Lorenz C., and Jean-Louis Reymond. “970 million druglike small molecules for virtual screening in the chemical universe database GDB-13.” Journal of the American Chemical Society 131.25 (2009): 8732-8733.
[2]Montavon, Grégoire, et al. “Machine learning of molecular electronic properties in chemical compound space.” New Journal of Physics 15.9 (2013): 095003.

QM8 Datasets

deepchem.molnet.load_qm8(featurizer='CoulombMatrix', split='random', reload=True, move_mean=True, data_dir=None, save_dir=None, **kwargs)[source]

Load QM8 dataset

QM8 is the dataset used in a study on modeling quantum mechanical calculations of electronic spectra and excited state energy of small molecules. Multiple methods, including time-dependent density functional theories (TDDFT) and second-order approximate coupled-cluster (CC2), are applied to a collection of molecules that include up to eight heavy atoms (also a subset of the GDB-17 database). In our collection, there are four excited state properties calculated by four different methods on 22 thousand samples:

S0 -> S1 transition energy E1 and the corresponding oscillator strength f1

S0 -> S2 transition energy E2 and the corresponding oscillator strength f2

E1, E2, f1, f2 are in atomic units. f1, f2 are in length representation

Random splitting is recommended for this dataset.

The source data contain:

  • qm8.sdf: molecular structures
  • qm8.sdf.csv: tables for molecular properties
    • Column 1: Molecule ID (gdb9 index) mapping to the .sdf file
    • Columns 2-5: RI-CC2/def2TZVP
    • Columns 6-9: LR-TDPBE0/def2SVP
    • Columns 10-13: LR-TDPBE0/def2TZVP
    • Columns 14-17: LR-TDCAM-B3LYP/def2TZVP

References

[1]Blum, Lorenz C., and Jean-Louis Reymond. “970 million druglike small molecules for virtual screening in the chemical universe database GDB-13.” Journal of the American Chemical Society 131.25 (2009): 8732-8733.
[2]Ramakrishnan, Raghunathan, et al. “Electronic spectra from TDDFT and machine learning in chemical space.” The Journal of chemical physics 143.8 (2015): 084111.

QM9 Datasets

deepchem.molnet.load_qm9(featurizer='CoulombMatrix', split='random', reload=True, move_mean=True, data_dir=None, save_dir=None, **kwargs)[source]

Load QM9 dataset

QM9 is a comprehensive dataset that provides geometric, energetic, electronic and thermodynamic properties for a subset of GDB-17 database, comprising 134 thousand stable organic molecules with up to 9 heavy atoms. All molecules are modeled using density functional theory (B3LYP/6-31G(2df,p) based DFT).

Random splitting is recommended for this dataset.

The source data contain:

  • qm9.sdf: molecular structures
  • qm9.sdf.csv: tables for molecular properties
    • “mol_id” - Molecule ID (gdb9 index) mapping to the .sdf file
    • “A” - Rotational constant (unit: GHz)
    • “B” - Rotational constant (unit: GHz)
    • “C” - Rotational constant (unit: GHz)
    • “mu” - Dipole moment (unit: D)
    • “alpha” - Isotropic polarizability (unit: Bohr^3)
    • “homo” - Highest occupied molecular orbital energy (unit: Hartree)
    • “lumo” - Lowest unoccupied molecular orbital energy (unit: Hartree)
    • “gap” - Gap between HOMO and LUMO (unit: Hartree)
    • “r2” - Electronic spatial extent (unit: Bohr^2)
    • “zpve” - Zero point vibrational energy (unit: Hartree)
    • “u0” - Internal energy at 0K (unit: Hartree)
    • “u298” - Internal energy at 298.15K (unit: Hartree)
    • “h298” - Enthalpy at 298.15K (unit: Hartree)
    • “g298” - Free energy at 298.15K (unit: Hartree)
    • “cv” - Heat capavity at 298.15K (unit: cal/(mol*K))
    • “u0_atom” - Atomization energy at 0K (unit: kcal/mol)
    • “u298_atom” - Atomization energy at 298.15K (unit: kcal/mol)
    • “h298_atom” - Atomization enthalpy at 298.15K (unit: kcal/mol)
    • “g298_atom” - Atomization free energy at 298.15K (unit: kcal/mol)

“u0_atom” ~ “g298_atom” (used in MoleculeNet) are calculated from the differences between “u0” ~ “g298” and sum of reference energies of all atoms in the molecules, as given in https://figshare.com/articles/Atomref%3A_Reference_thermochemical_energies_of_H%2C_C%2C_N%2C_O%2C_F_atoms./1057643

References

[1]Blum, Lorenz C., and Jean-Louis Reymond. “970 million druglike small molecules for virtual screening in the chemical universe database GDB-13.” Journal of the American Chemical Society 131.25 (2009): 8732-8733.
[2]Ramakrishnan, Raghunathan, et al. “Quantum chemistry structures and properties of 134 kilo molecules.” Scientific data 1 (2014): 140022.

SAMPL Datasets

deepchem.molnet.load_sampl(featurizer='ECFP', split='index', reload=True, move_mean=True, data_dir=None, save_dir=None, **kwargs)[source]

Load SAMPL(FreeSolv) dataset

The Free Solvation Database, FreeSolv(SAMPL), provides experimental and calculated hydration free energy of small molecules in water. The calculated values are derived from alchemical free energy calculations using molecular dynamics simulations. The experimental values are included in the benchmark collection.

Random splitting is recommended for this dataset.

The raw data csv file contains columns below:

  • “iupac” - IUPAC name of the compound
  • “smiles” - SMILES representation of the molecular structure
  • “expt” - Measured solvation energy (unit: kcal/mol) of the compound, used as label
  • “calc” - Calculated solvation energy (unit: kcal/mol) of the compound

References

[1]Mobley, David L., and J. Peter Guthrie. “FreeSolv: a database of experimental and calculated hydration free energies, with input files.” Journal of computer-aided molecular design 28.7 (2014): 711-720.

SIDER Datasets

deepchem.molnet.load_sider(featurizer='ECFP', split='index', reload=True, K=4, data_dir=None, save_dir=None, **kwargs)[source]

Load SIDER dataset

The Side Effect Resource (SIDER) is a database of marketed drugs and adverse drug reactions (ADR). The version of the SIDER dataset in DeepChem has grouped drug side effects into 27 system organ classes following MedDRA classifications measured for 1427 approved drugs.

Random splitting is recommended for this dataset.

The raw data csv file contains columns below:

  • “smiles”: SMILES representation of the molecular structure
  • “Hepatobiliary disorders” ~ “Injury, poisoning and procedural complications”: Recorded side effects for the drug. Please refer to http://sideeffects.embl.de/se/?page=98 for details on ADRs.

References

[1]Kuhn, Michael, et al. “The SIDER database of drugs and side effects.” Nucleic acids research 44.D1 (2015): D1075-D1079.
[2]Altae-Tran, Han, et al. “Low data drug discovery with one-shot learning.” ACS central science 3.4 (2017): 283-293.
[3]Medical Dictionary for Regulatory Activities. http://www.meddra.org/

Thermosol Datasets

deepchem.molnet.load_thermosol(featurizer='ECFP', data_dir=None, save_dir=None, split=None, split_seed=None, reload=True)[source]

Loads the thermodynamic solubility datasets.

Tox21 Datasets

deepchem.molnet.load_tox21(featurizer='ECFP', split='index', reload=True, K=4, data_dir=None, save_dir=None, **kwargs)[source]

Load Tox21 dataset

The “Toxicology in the 21st Century” (Tox21) initiative created a public database measuring toxicity of compounds, which has been used in the 2014 Tox21 Data Challenge. This dataset contains qualitative toxicity measurements for 8k compounds on 12 different targets, including nuclear receptors and stress response pathways.

Random splitting is recommended for this dataset.

The raw data csv file contains columns below:

  • “smiles” - SMILES representation of the molecular structure
  • “NR-XXX” - Nuclear receptor signaling bioassays results
  • “SR-XXX” - Stress response bioassays results

please refer to https://tripod.nih.gov/tox21/challenge/data.jsp for details.

References

[1]Tox21 Challenge. https://tripod.nih.gov/tox21/challenge/

Toxcast Datasets

deepchem.molnet.load_toxcast(featurizer='ECFP', split='index', reload=True, data_dir=None, save_dir=None, **kwargs)[source]

Load Toxcast dataset

ToxCast is an extended data collection from the same initiative as Tox21, providing toxicology data for a large library of compounds based on in vitro high-throughput screening. The processed collection includes qualitative results of over 600 experiments on 8k compounds.

Random splitting is recommended for this dataset.

The raw data csv file contains columns below:

References

[1]Richard, Ann M., et al. “ToxCast chemical landscape: paving the road to 21st century toxicology.” Chemical research in toxicology 29.8 (2016): 1225-1251.

USPTO Datasets

deepchem.molnet.load_uspto(featurizer='plain', split=None, num_to_load=10000, reload=True, verbose=False, data_dir=None, save_dir=None, **kwargs)[source]

Load USPTO dataset.

For now, only loads the subset of data for 2008-2011 reactions. See https://figshare.com/articles/Chemical_reactions_from_US_patents_1976-Sep2016_/5104873 for more details. The full dataset contains some 400K reactions. This causes an out-of-memory error on development laptop if full dataset is featurized. For now, return a truncated subset of dataset. Reloading is not entirely supported for this dataset.

UV Datasets

deepchem.molnet.load_uv(shard_size=2000, featurizer=None, split=None, reload=True)[source]

Load UV dataset; does not do train/test split

The UV dataset is an in-house dataset from Merck that was first introduced in the following paper: Ramsundar, Bharath, et al. “Is multitask deep learning practical for pharma?.” Journal of chemical information and modeling 57.8 (2017): 2068-2076.

The UV dataset tests 10,000 of Merck’s internal compounds on 190 absorption wavelengths between 210 and 400 nm. Unlike most of the other datasets featured in MoleculeNet, the UV collection does not have structures for the compounds tested since they were proprietary Merck compounds. However, the collection does feature pre-computed descriptors for these compounds.

Note that the original train/valid/test split from the source data was preserved here, so this function doesn’t allow for alternate modes of splitting. Similarly, since the source data came pre-featurized, it is not possible to apply alternative featurizations.

Parameters:
  • shard_size (int, optional) – Size of the DiskDataset shards to write on disk
  • featurizer (optional) – Ignored since featurization pre-computed
  • split (optional) – Ignored since split pre-computed
  • reload (bool, optional) – Whether to automatically re-load from disk

ZINC15 Datasets

deepchem.molnet.load_zinc15(featurizer=<class 'deepchem.feat.molecule_featurizers.one_hot_featurizer.OneHotFeaturizer'>, transformers: List[T] = [<class 'deepchem.trans.transformers.NormalizationTransformer'>], splitter=<class 'deepchem.splits.splitters.RandomSplitter'>, reload: bool = True, data_dir: Optional[str] = None, save_dir: Optional[str] = None, featurizer_kwargs: Dict[str, object] = {}, splitter_kwargs: Dict[str, object] = {}, transformer_kwargs: Dict[str, Dict[str, object]] = {'NormalizationTransformer': {'transform_X': True}}, dataset_size: str = '250K', dataset_dimension: str = '2D', test_run: bool = False) → Tuple[List[T], Optional[Tuple], List[T]][source]

Load zinc15.

ZINC15 is a dataset of over 230 million purchasable compounds for virtual screening of small molecules to identify structures that are likely to bind to drug targets. ZINC15 data is currently available in 2D (SMILES string) format.

MolNet provides subsets of 250K, 1M, and 10M “lead-like” compounds from ZINC15. The full dataset of 270M “goldilocks” compounds is also available. Compounds in ZINC15 are labeled by their molecular weight and LogP (solubility) values. Each compound also has information about how readily available (purchasable) it is and its reactivity. Lead-like compounds have molecular weight between 300 and 350 Daltons and LogP between -1 and 3.5. Goldilocks compounds are lead-like compounds with LogP values further restricted to between 2 and 3.

If reload = True and data_dir (save_dir) is specified, the loader will attempt to load the raw dataset (featurized dataset) from disk. Otherwise, the dataset will be downloaded from the DeepChem AWS bucket.

For more information on ZINC15, please see [1]_ and https://zinc15.docking.org/.

Parameters:
  • size (str (default '250K')) – Size of dataset to download. Currently only ‘250K’ is supported.
  • format (str (default '2D')) – Format of data to download. 2D SMILES strings or 3D SDF files.
  • featurizer (allowed featurizers for this dataset) – A featurizer that inherits from deepchem.feat.Featurizer.
  • transformers (List of allowed transformers for this dataset) – A transformer that inherits from deepchem.trans.Transformer.
  • splitter (allowed splitters for this dataset) – A splitter that inherits from deepchem.splits.splitters.Splitter.
  • reload (bool (default True)) – Try to reload dataset from disk if already downloaded. Save to disk after featurizing.
  • data_dir (str, optional (default None)) – Path to datasets.
  • save_dir (str, optional (default None)) – Path to featurized datasets.
  • featurizer_kwargs (dict) – Specify parameters to featurizer, e.g. {“size”: 1024}
  • splitter_kwargs (dict) – Specify parameters to splitter, e.g. {“seed”: 42}
  • transformer_kwargs (dict) – Maps transformer names to constructor arguments, e.g. {“BalancingTransformer”: {“transform_x”:True, “transform_y”:False}}
  • dataset_size (str (default '250K')) – Number of compounds to download; ‘250K’, ‘1M’, ‘10M’, or ‘270M’.
  • dataset_dimension (str (default '2D')) – SMILES strings (2D) or 3D SDF files; ‘2D’ or ‘3D’
  • test_run (bool (default False)) – Flag to indicate tests, if True dataset is not downloaded.
Returns:

tasks, datasets, transformers

tasks : list

Column names corresponding to machine learning target variables.

datasets : tuple

train, validation, test splits of data as deepchem.data.datasets.Dataset instances.

transformers : list

deepchem.trans.transformers.Transformer instances applied to dataset.

Return type:

tuple

Notes

The total ZINC dataset with SMILES strings contains hundreds of millions of compounds and is over 100GB! ZINC250K is recommended for experimentation. The full set of 270M goldilocks compounds is 23GB.

References

[1]Sterling and Irwin. J. Chem. Inf. Model, 2015 http://pubs.acs.org/doi/abs/10.1021/acs.jcim.5b00559.

Examples

>>> import deepchem as dc
>>> tasks, datasets, transformers = dc.molnet.load_zinc15(test_run=True)
>>> train_dataset, val_dataset, test_dataset = datasets
>>> n_tasks = len(tasks)
>>> n_features = train_dataset.X.shape[1]
>>> model = dc.models.MultitaskRegressor(n_tasks, n_features)