MoleculeNet

The DeepChem library is packaged alongside the MoleculeNet suite of datasets. One of the most important parts of machine learning applications is finding a suitable dataset. The MoleculeNet suite has curated a whole range of datasets and loaded them into DeepChem dc.data.Dataset objects for convenience.

Contributing a new dataset to MoleculeNet

If you are proposing a new dataset to be included in the MoleculeNet benchmarking suite, please follow the instructions below. Please review the datasets already available in MolNet before contributing.

  1. Read the Contribution guidelines.
  2. Open an issue to discuss the dataset you want to add to MolNet.
  3. Implement a function in the deepchem.molnet.load_function module following the template function deepchem.molnet.load_function.load_dataset_template. Specify which featurizers, transformers, and splitters (available from deepchem.molnet.defaults) are supported for your dataset.
  4. Add your load function to deepchem.molnet.__init__.py for easy importing.
  5. Prepare your dataset as a .tar.gz or .zip file. Accepted filetypes include CSV, JSON, and SDF.
  6. Ask a member of the technical steering committee to add your .tar.gz or .zip file to the DeepChem AWS bucket. Modify your load function to pull down the dataset from AWS.
  7. Submit a [WIP] PR (Work in progress pull request) following the PR template.

BACE Dataset

deepchem.molnet.load_bace_classification(featurizer='ECFP', split='random', reload=True, data_dir=None, save_dir=None, **kwargs)[source]

Load bace datasets.

deepchem.molnet.load_bace_regression(featurizer='ECFP', split='random', reload=True, move_mean=True, data_dir=None, save_dir=None, **kwargs)[source]

Load bace datasets.

BBBC Datasets

deepchem.molnet.load_bbbc001(split='index', reload=True, data_dir=None, save_dir=None, **kwargs)[source]

Load BBBC001 dataset

This dataset contains 6 images of human HT29 colon cancer cells. The task is to learn to predict the cell counts in these images. This dataset is too small to serve to train algorithms, but might serve as a good test dataset. https://data.broadinstitute.org/bbbc/BBBC001/

deepchem.molnet.load_bbbc002(split='index', reload=True, data_dir=None, save_dir=None, **kwargs)[source]

Load BBBC002 dataset

This dataset contains data corresponding to 5 samples of Drosophilia Kc167 cells. There are 10 fields of view for each sample, each an image of size 512x512. Ground truth labels contain cell counts for this dataset. Full details about this dataset are present at https://data.broadinstitute.org/bbbc/BBBC002/.

BBBP Datasets

BBBP stands for Blood-Brain-Barrier Penetration

deepchem.molnet.load_bbbp(featurizer='ECFP', split='random', reload=True, data_dir=None, save_dir=None, **kwargs)[source]

Load blood-brain barrier penetration datasets

Cell Counting Datasets

deepchem.molnet.load_cell_counting(split=None, reload=True, data_dir=None, save_dir=None, **kwargs)[source]

Load Cell Counting dataset.

Loads the cell counting dataset from http://www.robots.ox.ac.uk/~vgg/research/counting/index_org.html.

Chembl Datasets

deepchem.molnet.load_chembl(shard_size=2000, featurizer='ECFP', set='5thresh', split='random', reload=True, data_dir=None, save_dir=None, **kwargs)[source]

Chembl25 Datasets

deepchem.molnet.load_chembl25(featurizer='smiles2seq', split='random', data_dir=None, save_dir=None, split_seed=None, reload=True, transformer_type='minmax', **kwargs)[source]

Loads the ChEMBL25 dataset, featurizes it, and does a split. :param featurizer: Featurizer to use :type featurizer: str, default smiles2seq :param split: Splitter to use :type split: str, default None :param data_dir: Directory to download data to, or load dataset from. (TODO: If None, make tmp) :type data_dir: str, default None :param save_dir: Directory to save the featurized dataset to. (TODO: If None, make tmp) :type save_dir: str, default None :param split_seed: Seed to be used for splitting the dataset :type split_seed: int, default None :param reload: Whether to reload saved dataset :type reload: bool, default True :param transformer_type: Transformer to use :type transformer_type: str, default minmax:

Clearance Datasets

deepchem.molnet.load_clearance(featurizer='ECFP', split='random', reload=True, move_mean=True, data_dir=None, save_dir=None, **kwargs)[source]

Load clearance datasets.

Clintox Datasets

deepchem.molnet.load_clintox(featurizer='ECFP', split='index', reload=True, data_dir=None, save_dir=None, **kwargs)[source]

Load clintox datasets.

The ClinTox dataset compares drugs approved by the FDA and drugs that have failed clinical trials for toxicity reasons. The dataset includes two classification tasks for 1491 drug compounds with known chemical structures: (1) clinical trial toxicity (or absence of toxicity) and (2) FDA approval status. List of FDA-approved drugs are compiled from the SWEETLEAD database, and list of drugs that failed clinical trials for toxicity reasons are compiled from the Aggregate Analysis of ClinicalTrials.gov(AACT) database.

The data file contains a csv table, in which columns below are used: “smiles” - SMILES representation of the molecular structure “FDA_APPROVED” - FDA approval status “CT_TOX” - Clinical trial results

References

[1]Gayvert, Kaitlyn M., Neel S. Madhukar, and Olivier Elemento. “A data-driven approach to predicting successes and failures of clinical trials.” Cell chemical biology 23.10 (2016): 1294-1301.
[2]Artemov, Artem V., et al. “Integrated deep learned transcriptomic and structure-based predictor of clinical trials outcomes.” bioRxiv (2016): 095653.
[3]Novick, Paul A., et al. “SWEETLEAD: an in silico database of approved drugs, regulated chemicals, and herbal isolates for computer-aided drug discovery.” PloS one 8.11 (2013): e79568.
[4]Aggregate Analysis of ClincalTrials.gov (AACT) Database. https://www.ctti-clinicaltrials.org/aact-database

Delaney Datasets

deepchem.molnet.load_delaney(featurizer='ECFP', split='index', reload=True, move_mean=True, data_dir=None, save_dir=None, **kwargs)[source]

Load delaney datasets.

The Delaney datasets are extracted from the following paper

Delaney, John S. “ESOL: estimating aqueous solubility directly from molecular structure.” Journal of chemical information and computer sciences 44.3 (2004): 1000-1005.

This dataset contains 2874 measured aqueous solubility values. The source dataset is available in the supplemental material of the original paper.

Factors Datasets

deepchem.molnet.load_factors(shard_size=2000, featurizer=None, split=None, reload=True)[source]

Loads FACTOR dataset; does not do train/test split

The Factors dataset is an in-house dataset from Merck that was first introduced in the following paper: Ramsundar, Bharath, et al. “Is multitask deep learning practical for pharma?.” Journal of chemical information and modeling 57.8 (2017): 2068-2076.

It contains 1500 Merck in-house compounds that were measured for IC50 of inhibition on 12 serine proteases. Unlike most of the other datasets featured in MoleculeNet, the Factors collection does not have structures for the compounds tested since they were proprietary Merck compounds. However, the collection does feature pre-computed descriptors for these compounds.

Note that the original train/valid/test split from the source data was preserved here, so this function doesn’t allow for alternate modes of splitting. Similarly, since the source data came pre-featurized, it is not possible to apply alternative featurizations.

Parameters:
  • shard_size (int, optional) – Size of the DiskDataset shards to write on disk
  • featurizer (optional) – Ignored since featurization pre-computed
  • split (optional) – Ignored since split pre-computed
  • reload (bool, optional) – Whether to automatically re-load from disk

HIV Datasets

deepchem.molnet.load_hiv(featurizer='ECFP', split='index', reload=True, data_dir=None, save_dir=None, **kwargs)[source]

Load hiv datasets. Does not do train/test split

The HIV dataset was introduced by the Drug Therapeutics Program (DTP) AIDS Antiviral Screen, which tested the ability to inhibit HIV replication for over 40,000 compounds. Screening results were evaluated and placed into three categories: confirmed inactive (CI),confirmed active (CA) and confirmed moderately active (CM). We further combine the latter two labels, making it a classification task between inactive (CI) and active (CA and CM).

The data file contains a csv table, in which columns below are used: - “smiles”: SMILES representation of the molecular structure - “activity”: Three-class labels for screening results: CI/CM/CA - “HIV_active”: Binary labels for screening results: 1 (CA/CM) and 0 (CI)

References

[1]AIDS Antiviral Screen Data. https://wiki.nci.nih.gov/display/NCIDTPdata/AIDS+Antiviral+Screen+Data

HOPV Datasets

HOPV stands for the Harvard Organic Photovoltaic Dataset.

deepchem.molnet.load_hopv(featurizer='ECFP', split='index', reload=True, data_dir=None, save_dir=None, **kwargs)[source]

Load HOPV datasets. Does not do train/test split

The HOPV datasets consist of the “Harvard Organic Photovoltaic Dataset. This dataset includes 350 small molecules and polymers that were utilized as p-type materials in OPVs. Experimental properties include: HOMO [a.u.], LUMO [a.u.], Electrochemical gap [a.u.], Optical gap [a.u.], Power conversion efficiency [%], Open circuit potential [V], Short circuit current density [mA/cm^2], and fill factor [%]. Theoretical calculations in the original dataset have been removed (for now).

Lopez, Steven A., et al. “The Harvard organic photovoltaic dataset.” Scientific data 3.1 (2016): 1-7.

HPPB Datasets

deepchem.molnet.load_hppb(featurizer='ECFP', data_dir=None, save_dir=None, split=None, split_seed=None, reload=True, **kwargs)[source]

Loads the thermodynamic solubility datasets.

KAGGLE Datasets

deepchem.molnet.load_kaggle(shard_size=2000, featurizer=None, split=None, reload=True)[source]

Loads kaggle datasets. Generates if not stored already.

The Kaggle dataset is an in-house dataset from Merck that was first introduced in the following paper:

Ma, Junshui, et al. “Deep neural nets as a method for quantitative structure–activity relationships.” Journal of chemical information and modeling 55.2 (2015): 263-274.

It contains 100,000 unique Merck in-house compounds that were measured on 15 enzyme inhibition and ADME/TOX datasets. Unlike most of the other datasets featured in MoleculeNet, the Kaggle collection does not have structures for the compounds tested since they were proprietary Merck compounds. However, the collection does feature pre-computed descriptors for these compounds.

Note that the original train/valid/test split from the source data was preserved here, so this function doesn’t allow for alternate modes of splitting. Similarly, since the source data came pre-featurized, it is not possible to apply alternative featurizations.

Parameters:
  • shard_size (int, optional) – Size of the DiskDataset shards to write on disk
  • featurizer (optional) – Ignored since featurization pre-computed
  • split (optional) – Ignored since split pre-computed
  • reload (bool, optional) – Whether to automatically re-load from disk

Kinase Datasets

deepchem.molnet.load_kinase(shard_size=2000, featurizer=None, split=None, reload=True)[source]

Loads Kinase datasets, does not do train/test split

The Kinase dataset is an in-house dataset from Merck that was first introduced in the following paper: Ramsundar, Bharath, et al. “Is multitask deep learning practical for pharma?.” Journal of chemical information and modeling 57.8 (2017): 2068-2076.

It contains 2500 Merck in-house compounds that were measured for IC50 of inhibition on 99 protein kinases. Unlike most of the other datasets featured in MoleculeNet, the Kinase collection does not have structures for the compounds tested since they were proprietary Merck compounds. However, the collection does feature pre-computed descriptors for these compounds.

Note that the original train/valid/test split from the source data was preserved here, so this function doesn’t allow for alternate modes of splitting. Similarly, since the source data came pre-featurized, it is not possible to apply alternative featurizations.

Parameters:
  • shard_size (int, optional) – Size of the DiskDataset shards to write on disk
  • featurizer (optional) – Ignored since featurization pre-computed
  • split (optional) – Ignored since split pre-computed
  • reload (bool, optional) – Whether to automatically re-load from disk

Lipo Datasets

deepchem.molnet.load_lipo(featurizer='ECFP', split='index', reload=True, move_mean=True, data_dir=None, save_dir=None, **kwargs)[source]

Load Lipophilicity datasets.

MUV Datasets

deepchem.molnet.load_muv(featurizer='ECFP', split='index', reload=True, K=4, data_dir=None, save_dir=None, **kwargs)[source]

Load MUV datasets. Does not do train/test split

NCI Datasets

deepchem.molnet.load_nci(featurizer='ECFP', shard_size=1000, split='random', reload=True, data_dir=None, save_dir=None, **kwargs)[source]

PCBA Datasets

deepchem.molnet.load_pcba(featurizer='ECFP', split='random', reload=True, data_dir=None, save_dir=None, **kwargs)[source]

PDBBIND Datasets

deepchem.molnet.load_pdbbind(reload=True, data_dir=None, subset='core', load_binding_pocket=False, featurizer='grid', split='random', split_seed=None, save_dir=None, save_timestamp=False)[source]

Load raw PDBBind dataset by featurization and split.

Parameters:
  • reload (Bool, optional) – Reload saved featurized and splitted dataset or not.
  • data_dir (Str, optional) – Specifies the directory storing the raw dataset.
  • load_binding_pocket (Bool, optional) – Load binding pocket or full protein.
  • subset (Str) – Specifies which subset of PDBBind, only “core” or “refined” for now.
  • featurizer (Str) – Either “grid” or “atomic” for grid and atomic featurizations.
  • split (Str) – Either “random” or “index”.
  • split_seed (Int, optional) – Specifies the random seed for splitter.
  • save_dir (Str, optional) – Specifies the directory to store the featurized and splitted dataset when reload is False. If reload is True, it will load saved dataset inside save_dir.
  • save_timestamp (Bool, optional) – Save featurized and splitted dataset with timestamp or not. Set it as True when running similar or same jobs simultaneously on multiple compute nodes.

PPB Datasets

deepchem.molnet.load_ppb(featurizer='ECFP', split='index', reload=True, data_dir=None, save_dir=None, **kwargs)[source]

Load PPB datasets.

QM7 Datasets

deepchem.molnet.load_qm7(featurizer='CoulombMatrix', split='random', reload=True, move_mean=True, data_dir=None, save_dir=None, **kwargs)[source]

Load qm7 datasets.

QM7 is a subset of GDB-13 (a database of nearly 1 billion stable and synthetically accessible organic molecules) containing up to 7 heavy atoms C, N, O, and S. The 3D Cartesian coordinates of the most stable conformations and their atomization energies were determined using ab-initio density functional theory (PBE0/tier2 basis set).This dataset also provided Coulomb matrices as calculated in [Rupp et al. PRL, 2012]:

C_ii = 0.5 * Z^2.4 C_ij = Z_i * Z_j/abs(R_i − R_j) Z_i - nuclear charge of atom i R_i - cartesian coordinates of atom i
The data file (.mat format, we recommend using scipy.io.loadmat for python users to load this original data) contains five arrays:
“X” - (7165 x 23 x 23), Coulomb matrices “T” - (7165), atomization energies (unit: kcal/mol) “P” - (5 x 1433), cross-validation splits as used in [Montavon et al. NIPS, 2012] “Z” - (7165 x 23), atomic charges “R” - (7165 x 23 x 3), cartesian coordinate (unit: Bohr) of each atom in the molecules

Reference: Rupp, Matthias, et al. “Fast and accurate modeling of molecular atomization energies with machine learning.” Physical review letters 108.5 (2012): 058301. Montavon, Grégoire, et al. “Learning invariant representations of molecules for atomization energy prediction.” Advances in Neural Information Processing Systems. 2012.

deepchem.molnet.load_qm7_from_mat(featurizer='CoulombMatrix', split='stratified', reload=True, move_mean=True, data_dir=None, save_dir=None, **kwargs)[source]
deepchem.molnet.load_qm7b_from_mat(featurizer='CoulombMatrix', split='stratified', reload=True, move_mean=True, data_dir=None, save_dir=None, **kwargs)[source]

Load QM7B dataset

QM7b is an extension for the QM7 dataset with additional properties predicted at different levels (ZINDO, SCS, PBE0, GW). In total 14 tasks are included for 7211 molecules with up to 7 heavy atoms.

The dataset in .mat format(for python users, we recommend using scipy.io.loadmat) includes two arrays: “X” - (7211 x 23 x 23), Coulomb matrices “T” - (7211 x 14), properties

Atomization energies E (PBE0, unit: kcal/mol) Excitation of maximal optimal absorption E_max (ZINDO, unit: eV) Absorption Intensity at maximal absorption I_max (ZINDO) Highest occupied molecular orbital HOMO (ZINDO, unit: eV) Lowest unoccupied molecular orbital LUMO (ZINDO, unit: eV) First excitation energy E_1st (ZINDO, unit: eV) Ionization potential IP (ZINDO, unit: eV) Electron affinity EA (ZINDO, unit: eV) Highest occupied molecular orbital HOMO (PBE0, unit: eV) Lowest unoccupied molecular orbital LUMO (PBE0, unit: eV) Highest occupied molecular orbital HOMO (GW, unit: eV) Lowest unoccupied molecular orbital LUMO (GW, unit: eV) Polarizabilities α (PBE0, unit: Å^3) Polarizabilities α (SCS, unit: Å^3)

References

[1]Blum, Lorenz C., and Jean-Louis Reymond. “970 million druglike small molecules for virtual screening in the chemical universe database GDB-13.” Journal of the American Chemical Society 131.25 (2009): 8732-8733.
[2]Montavon, Grégoire, et al. “Machine learning of molecular electronic properties in chemical compound space.” New Journal of Physics 15.9 (2013): 095003.

QM8 Datasets

deepchem.molnet.load_qm8(featurizer='CoulombMatrix', split='random', reload=True, move_mean=True, data_dir=None, save_dir=None, **kwargs)[source]

Load QM8 Datasets

The QM8 is the dataset used in a study on modeling quantum mechanical calculations of electronic spectra and excited state energy of small molecules. Multiple methods, including time-dependent density functional theories (TDDFT) and second-order approximate coupled-cluster (CC2), are applied to a collection of molecules that include up to eight heavy atoms (also a subset of the GDB-17 database). In our collection, there are four excited state properties calculated by four different methods on 22 thousand samples:

S_0 -> S_1 transition energy E_1 and the corresponding oscillator strength f_1 S_0 -> S_2 transition energy E_2 and the corresponding oscillator strength f_2

The source data files (downloadable from moleculenet.ai): qm8.sdf: molecular structures qm8.sdf.csv: tables for molecular properties Column 1: Molecule ID (gdb9 index) mapping to the .sdf file Columns 2-5: RI-CC2/def2TZVP; E1, E2, f1, f2 in atomic units. f1, f2 in length representation Columns 6-9: LR-TDPBE0/def2SVP; E1, E2, f1, f2 in atomic units. f1, f2 in length representation Columns 10-13: LR-TDPBE0/def2TZVP; E1, E2, f1, f2 in atomic units. f1, f2 in length representation Columns 14-17: LR-TDCAM-B3LYP/def2TZVP; E1, E2, f1, f2 in atomic units. f1, f2 in length representation

References

[1]Blum, Lorenz C., and Jean-Louis Reymond. “970 million druglike small molecules for virtual screening in the chemical universe database GDB-13.” Journal of the American Chemical Society 131.25 (2009): 8732-8733.
[2]Ramakrishnan, Raghunathan, et al. “Electronic spectra from TDDFT and machine learning in chemical space.” The Journal of chemical physics 143.8 (2015): 084111.

QM9 Datasets

deepchem.molnet.load_qm9(featurizer='CoulombMatrix', split='random', reload=True, move_mean=True, data_dir=None, save_dir=None, **kwargs)[source]

Load qm9 datasets.

SAMPL Datasets

deepchem.molnet.load_sampl(featurizer='ECFP', split='index', reload=True, move_mean=True, data_dir=None, save_dir=None, **kwargs)[source]

Load SAMPL datasets.

SIDER Datasets

deepchem.molnet.load_sider(featurizer='ECFP', split='index', reload=True, K=4, data_dir=None, save_dir=None, **kwargs)[source]

Load SIDER datasets

The Side Effect Resource (SIDER) is a database of marketed drugs and adverse drug reactions (ADR). The version of the SIDER dataset in DeepChem has grouped drug side effects into 27 system organ classes following MedDRA classifications measured for 1427 approved drugs.

The data file contains a csv table, in which columns below are used: - “smiles”: SMILES representation of the molecular structure - “Hepatobiliary disorders” ~ “Injury, poisoning and procedural complications”: Recorded side effects for the drug

Please refer to http://sideeffects.embl.de/se/?page=98 for details on ADRs.

References

[1]Kuhn, Michael, et al. “The SIDER database of drugs and side effects.” Nucleic acids research 44.D1 (2015): D1075-D1079.
[2]Altae-Tran, Han, et al. “Low data drug discovery with one-shot learning.” ACS central science 3.4 (2017): 283-293.
[3]Medical Dictionary for Regulatory Activities. http://www.meddra.org/

Thermosol Datasets

deepchem.molnet.load_thermosol(featurizer='ECFP', data_dir=None, save_dir=None, split=None, split_seed=None, reload=True)[source]

Loads the thermodynamic solubility datasets.

Tox21 Datasets

deepchem.molnet.load_tox21(featurizer='ECFP', split='index', reload=True, K=4, data_dir=None, save_dir=None, **kwargs)[source]

Load Tox21 datasets. Does not do train/test split

Toxcast Datasets

deepchem.molnet.load_toxcast(featurizer='ECFP', split='index', reload=True, data_dir=None, save_dir=None, **kwargs)[source]

Loads the Toxcast datasets.

ToxCast is an extended data collection from the same initiative as Tox21, providing toxicology data for a large library of compounds based on in vitro high-throughput screening. The processed collection includes qualitative results of over 600 experiments on 8k compounds.

The source data file contains a csv table, in which columns below are used:

Richard, Ann M., et al. “ToxCast chemical landscape: paving the road to 21st century toxicology.” Chemical research in toxicology 29.8 (2016): 1225-1251.

USPTO Datasets

deepchem.molnet.load_uspto(featurizer='plain', split=None, num_to_load=10000, reload=True, verbose=False, data_dir=None, save_dir=None, **kwargs)[source]

Load USPTO dataset.

For now, only loads the subset of data for 2008-2011 reactions. See https://figshare.com/articles/Chemical_reactions_from_US_patents_1976-Sep2016_/5104873 for more details. The full dataset contains some 400K reactions. This causes an out-of-memory error on development laptop if full dataset is featurized. For now, return a truncated subset of dataset. Reloading is not entirely supported for this dataset.

UV Datasets

deepchem.molnet.load_uv(shard_size=2000, featurizer=None, split=None, reload=True)[source]

Load UV dataset; does not do train/test split

The UV dataset is an in-house dataset from Merck that was first introduced in the following paper: Ramsundar, Bharath, et al. “Is multitask deep learning practical for pharma?.” Journal of chemical information and modeling 57.8 (2017): 2068-2076.

The UV dataset tests 10,000 of Merck’s internal compounds on 190 absorption wavelengths between 210 and 400 nm. Unlike most of the other datasets featured in MoleculeNet, the UV collection does not have structures for the compounds tested since they were proprietary Merck compounds. However, the collection does feature pre-computed descriptors for these compounds.

Note that the original train/valid/test split from the source data was preserved here, so this function doesn’t allow for alternate modes of splitting. Similarly, since the source data came pre-featurized, it is not possible to apply alternative featurizations.

Parameters:
  • shard_size (int, optional) – Size of the DiskDataset shards to write on disk
  • featurizer (optional) – Ignored since featurization pre-computed
  • split (optional) – Ignored since split pre-computed
  • reload (bool, optional) – Whether to automatically re-load from disk