MoleculeNet¶
The DeepChem library is packaged alongside the MoleculeNet suite of datasets.
One of the most important parts of machine learning applications is finding a suitable dataset.
The MoleculeNet suite has curated a whole range of datasets and loaded them into DeepChem
dc.data.Dataset
objects for convenience.
MoleculeNet Cheatsheet¶
When training a model or performing a benchmark, the user needs specific datasets. However, at the beginning, this search can be exhaustive and confusing. The following cheatsheet is aimed at helping DeepChem users identify more easily which dataset to use depending on their purposes.
Each row reprents a dataset where a brief description is given. Also, the columns represents the type of the data; depending on molecule properties, images or materials and how many data points they have. Each dataset is referenced with a link of the paper. Finally, there are some entries that need further information.
Cheatsheet
Name |
Description |
Type |
Data Points |
Reference |
---|---|---|---|---|
BACE (Regression) |
Provides bindings results for a set of inhibitors of human beta-secretase (BACE-1) |
Molecules |
1513 |
|
BACE (Classification) |
Provides bindings results for a set of inhibitors of human beta-secretase (BACE-1) |
Molecules |
1513 |
|
BBBC (BBBC001) |
Images of HT29 colon cancer cells |
Images |
6 |
|
BBBC (BBBC002) |
Images of Drosophilia Kc167 cells |
Images |
50 |
|
BBBP |
Blood-Brain Barrier Penetration designed for the modeling and prediction of barrier permeability |
Binary labels on permeability properties |
2000 |
|
Cell Counting |
Synthetic emulations of fluorescence microscopic images of bacterial cells |
Images |
200 |
|
ChEMBL (set = ‘sparse’) |
A sparse subset of ChEMBL with activity data for one target |
Molecules |
244 245 |
|
ChEMBL (set = ‘5thresh’) |
A subset of ChEMBL with activity data for at least five targets |
Molecules |
23 871 |
|
ChEMBL25 |
Molecules |
|||
Clearance |
||||
Clintox |
Compares drugs approved by the FDA and drugs that have failed clinical trials for toxicity reasons. |
Molecules |
1491 |
|
Delaney |
A regression dataset containing structures and water solubility data |
Molecules |
1128 |
|
Factors |
Merck in-house compounds that were measured for IC50 of inhibition on 12 serine proteases |
Molecules |
1500 |
|
Freesolv |
A collection of experimental and calculated hydration free energies for small molecules in water |
Molecules |
643 |
|
HIV |
A dataset wich tested the ability to inhibit HIV replication |
Molecules |
40 000 |
|
HOPV |
Harvard Organic Photovoltaic dataset utilized as p-type materials |
Molecules |
350 |
|
HPPB |
Thermosynamic solubility datasets |
|||
KAGGLE |
in-house compounds that were measured on 15 enzyme inhibition and ADME/TOX datasets. |
Molecules |
100 000 |
|
KINASE |
In-house compounds that were measured for IC50 of inhibition on 99 protein kinases |
Molecules |
2 500 |
|
LIPO |
Experimental results of octanol/water distribution coefficient (logD at pH 7.4) |
Molecules |
4 200 |
|
Band Gap |
Experimentally measured band gaps for inorganic crystal structure |
Materials |
4 604 |
|
Perovskite |
Contains Perovskite structures and their formation energies |
Materials |
18 928 |
|
MP Formation Energy |
Contains calculated formation energies and inorganic crystal structures from the Materials Project database |
Materials |
132 752 |
|
MP Metallicity |
Contains inorganic crystal structures from the Materials Project database labeled as metals or nonmetals |
Materials |
106 113 |
|
MUV |
Benchmark dataset selected from PubChem BioAssay by applying a refined nearest neighbor analysis |
Molecules |
90 000 |
|
NCI |
||||
PCBA |
Database consisting of biological activities of small molecules generated by high-throughput screening |
Molecules |
400 000 |
|
PDBBIND |
Experimental binding affinity data and structures of protein-ligand complexes |
Molecules |
“refined set” 4 852 - “general set” 12 800 - “core set” 193 |
|
PPB |
||||
QM7 |
Subset of GDB-13 containing up to 7 heavy atoms CNOS |
Molecules |
7 165 |
|
QM8 |
Dataset used in a study on modeling quantum mechanical calculations of electronic spectra and excited state energy of small molecules |
Molecules |
20 000 |
|
QM9 |
Dataset that provides geometric/energetic/electronic and thermodynamic properties for a subset of GDB-17 database |
Molecules |
134 000 |
|
SAMPL |
Similat to FreeSolv dataset which provides experimental and calculated hydration free energy of small molecules in water |
|||
SIDER |
The Side Effect Resource (SIDER) is a database of marketed drugs and adverse drug reactions (ADR) |
Molecules |
1 427 |
|
Thermosol |
Thermodynamic solubility datasets |
|||
Tox21 |
The “Toxicology in the 21st Century” (Tox21) initiative created a public database measuring the toxicity of compounds |
Molecules |
8 000 |
|
Toxcast |
Toxicology data for an extensive library of compounds based on in vitro high-throughput screening |
Molecules |
8 000 |
|
USPTO |
Subsets of USPTO dataset of organic chemical reactions extracted from US patents and patent applications |
Chemical reactions SMILES |
MIT 479 000 - STEREO 1 M - 50K 50 000 |
|
UV |
The UV dataset tests Merck’s internal compounds on 190 absorption wavelengths between 210 and 400 nm |
Molecules |
10 000 |
|
ZINC15 |
Purchasable compounds for virtual screening of small molecules to identify structures that are likely to bind to drug targets |
Molecules |
250K - 1M - 10M |
|
Platinum Adsorption |
Different configurations of Adsorbates (i.e N and NO) on Platinum surface represented as Lattice and their formation energy |
Adsorbate Configurations |
648 |
Contributing a new dataset to MoleculeNet¶
If you are proposing a new dataset to be included in the MoleculeNet benchmarking suite, please follow the instructions below. Please review the datasets already available in MolNet before contributing.
Read the Contribution guidelines.
Open an issue to discuss the dataset you want to add to MolNet.
Write a DatasetLoader class that inherits from deepchem.molnet.load_function.molnet_loader._MolnetLoader and implements a create_dataset method. See the _QM9Loader for a simple example.
Write a load_dataset function that documents the dataset and add your load function to deepchem.molnet.__init__.py for easy importing.
Prepare your dataset as a .tar.gz or .zip file. Accepted filetypes include CSV, JSON, and SDF.
Ask a member of the technical steering committee to add your .tar.gz or .zip file to the DeepChem AWS bucket. Modify your load function to pull down the dataset from AWS.
Add documentation for your loader to the MoleculeNet docs.
Submit a [WIP] PR (Work in progress pull request) following the PR template.
Example Usage¶
Below is an example of how to load a MoleculeNet dataset and featurizer. This approach will work for any dataset in MoleculeNet by changing the load function and featurizer. For more details on the featurizers, see the Featurizers section.
import deepchem as dc
from deepchem.feat.molecule_featurizers import MolGraphConvFeaturizer
featurizer = MolGraphConvFeaturizer(use_edges=True)
dataset_dc = dc.molnet.load_qm9(featurizer=featurizer)
tasks, dataset, transformers = dataset_dc
train, valid, test = dataset
x,y,w,ids = train.X, train.y, train.w, train.ids
Note that the “w” matrix represents the weight of each sample. Some assays may have missing values, in which case the weight is 0. Otherwise, the weight is 1.
Additionally, the environment variable DEEPCHEM_DATA_DIR
can be set like os.environ['DEEPCHEM_DATA_DIR'] = path/to/store/featurized/dataset
. When the DEEPCHEM_DATA_DIR
environment variable is set, molnet loader stores the featurized dataset in the specified directory and when the dataset has to be reloaded the next time, it will be fetched from the data directory directly rather than featurizing the raw dataset from scratch.
BACE Dataset¶
- load_bace_classification(featurizer: Union[deepchem.feat.base_classes.Featurizer, str] = 'ECFP', splitter: Optional[Union[deepchem.splits.splitters.Splitter, str]] = 'scaffold', transformers: List[Union[deepchem.molnet.load_function.molnet_loader.TransformerGenerator, str]] = ['balancing'], reload: bool = True, data_dir: Optional[str] = None, save_dir: Optional[str] = None, **kwargs) Tuple[List[str], Tuple[deepchem.data.datasets.Dataset, ...], List[transformers.Transformer]] [source]¶
Load BACE dataset, classification labels
BACE dataset with classification labels (“class”).
- Parameters
featurizer (Featurizer or str) – the featurizer to use for processing the data. Alternatively you can pass one of the names from dc.molnet.featurizers as a shortcut.
splitter (Splitter or str) – the splitter to use for splitting the data into training, validation, and test sets. Alternatively you can pass one of the names from dc.molnet.splitters as a shortcut. If this is None, all the data will be included in a single dataset.
transformers (list of TransformerGenerators or strings) – the Transformers to apply to the data. Each one is specified by a TransformerGenerator or, as a shortcut, one of the names from dc.molnet.transformers.
reload (bool) – if True, the first call for a particular featurizer and splitter will cache the datasets to disk, and subsequent calls will reload the cached datasets.
data_dir (str) – a directory to save the raw data in
save_dir (str) – a directory to save the dataset in
- load_bace_regression(featurizer: Union[deepchem.feat.base_classes.Featurizer, str] = 'ECFP', splitter: Optional[Union[deepchem.splits.splitters.Splitter, str]] = 'scaffold', transformers: List[Union[deepchem.molnet.load_function.molnet_loader.TransformerGenerator, str]] = ['normalization'], reload: bool = True, data_dir: Optional[str] = None, save_dir: Optional[str] = None, **kwargs) Tuple[List[str], Tuple[deepchem.data.datasets.Dataset, ...], List[transformers.Transformer]] [source]¶
Load BACE dataset, regression labels
The BACE dataset provides quantitative IC50 and qualitative (binary label) binding results for a set of inhibitors of human beta-secretase 1 (BACE-1).
All data are experimental values reported in scientific literature over the past decade, some with detailed crystal structures available. A collection of 1522 compounds is provided, along with the regression labels of IC50.
Scaffold splitting is recommended for this dataset.
The raw data csv file contains columns below:
“mol” - SMILES representation of the molecular structure
“pIC50” - Negative log of the IC50 binding affinity
“class” - Binary labels for inhibitor
- Parameters
featurizer (Featurizer or str) – the featurizer to use for processing the data. Alternatively you can pass one of the names from dc.molnet.featurizers as a shortcut.
splitter (Splitter or str) – the splitter to use for splitting the data into training, validation, and test sets. Alternatively you can pass one of the names from dc.molnet.splitters as a shortcut. If this is None, all the data will be included in a single dataset.
transformers (list of TransformerGenerators or strings) – the Transformers to apply to the data. Each one is specified by a TransformerGenerator or, as a shortcut, one of the names from dc.molnet.transformers.
reload (bool) – if True, the first call for a particular featurizer and splitter will cache the datasets to disk, and subsequent calls will reload the cached datasets.
data_dir (str) – a directory to save the raw data in
save_dir (str) – a directory to save the dataset in
References
- 1
Subramanian, Govindan, et al. “Computational modeling of β-secretase 1 (BACE-1) inhibitors using ligand based approaches.” Journal of chemical information and modeling 56.10 (2016): 1936-1949.
BBBC Datasets¶
- load_bbbc001(splitter: Optional[Union[deepchem.splits.splitters.Splitter, str]] = 'index', transformers: List[Union[deepchem.molnet.load_function.molnet_loader.TransformerGenerator, str]] = [], reload: bool = True, data_dir: Optional[str] = None, save_dir: Optional[str] = None, **kwargs) Tuple[List[str], Tuple[deepchem.data.datasets.Dataset, ...], List[transformers.Transformer]] [source]¶
Load BBBC001 dataset
This dataset contains 6 images of human HT29 colon cancer cells. The task is to learn to predict the cell counts in these images. This dataset is too small to serve to train algorithms, but might serve as a good test dataset. https://data.broadinstitute.org/bbbc/BBBC001/
- Parameters
splitter (Splitter or str) – the splitter to use for splitting the data into training, validation, and test sets. Alternatively you can pass one of the names from dc.molnet.splitters as a shortcut. If this is None, all the data will be included in a single dataset.
transformers (list of TransformerGenerators or strings) – the Transformers to apply to the data. Each one is specified by a TransformerGenerator or, as a shortcut, one of the names from dc.molnet.transformers.
reload (bool) – if True, the first call for a particular featurizer and splitter will cache the datasets to disk, and subsequent calls will reload the cached datasets.
data_dir (str) – a directory to save the raw data in
save_dir (str) – a directory to save the dataset in
- load_bbbc002(splitter: Optional[Union[deepchem.splits.splitters.Splitter, str]] = 'index', transformers: List[Union[deepchem.molnet.load_function.molnet_loader.TransformerGenerator, str]] = [], reload: bool = True, data_dir: Optional[str] = None, save_dir: Optional[str] = None, **kwargs) Tuple[List[str], Tuple[deepchem.data.datasets.Dataset, ...], List[transformers.Transformer]] [source]¶
Load BBBC002 dataset
This dataset contains data corresponding to 5 samples of Drosophilia Kc167 cells. There are 10 fields of view for each sample, each an image of size 512x512. Ground truth labels contain cell counts for this dataset. Full details about this dataset are present at https://data.broadinstitute.org/bbbc/BBBC002/.
- Parameters
splitter (Splitter or str) – the splitter to use for splitting the data into training, validation, and test sets. Alternatively you can pass one of the names from dc.molnet.splitters as a shortcut. If this is None, all the data will be included in a single dataset.
transformers (list of TransformerGenerators or strings) – the Transformers to apply to the data. Each one is specified by a TransformerGenerator or, as a shortcut, one of the names from dc.molnet.transformers.
reload (bool) – if True, the first call for a particular featurizer and splitter will cache the datasets to disk, and subsequent calls will reload the cached datasets.
data_dir (str) – a directory to save the raw data in
save_dir (str) – a directory to save the dataset in
BBBP Datasets¶
BBBP stands for Blood-Brain-Barrier Penetration
- load_bbbp(featurizer: Union[deepchem.feat.base_classes.Featurizer, str] = 'ECFP', splitter: Optional[Union[deepchem.splits.splitters.Splitter, str]] = 'scaffold', transformers: List[Union[deepchem.molnet.load_function.molnet_loader.TransformerGenerator, str]] = ['balancing'], reload: bool = True, data_dir: Optional[str] = None, save_dir: Optional[str] = None, **kwargs) Tuple[List[str], Tuple[deepchem.data.datasets.Dataset, ...], List[transformers.Transformer]] [source]¶
Load BBBP dataset
The blood-brain barrier penetration (BBBP) dataset is designed for the modeling and prediction of barrier permeability. As a membrane separating circulating blood and brain extracellular fluid, the blood-brain barrier blocks most drugs, hormones and neurotransmitters. Thus penetration of the barrier forms a long-standing issue in development of drugs targeting central nervous system.
This dataset includes binary labels for over 2000 compounds on their permeability properties.
Scaffold splitting is recommended for this dataset.
The raw data csv file contains columns below:
“name” - Name of the compound
“smiles” - SMILES representation of the molecular structure
“p_np” - Binary labels for penetration/non-penetration
- Parameters
featurizer (Featurizer or str) – the featurizer to use for processing the data. Alternatively you can pass one of the names from dc.molnet.featurizers as a shortcut.
splitter (Splitter or str) – the splitter to use for splitting the data into training, validation, and test sets. Alternatively you can pass one of the names from dc.molnet.splitters as a shortcut. If this is None, all the data will be included in a single dataset.
transformers (list of TransformerGenerators or strings) – the Transformers to apply to the data. Each one is specified by a TransformerGenerator or, as a shortcut, one of the names from dc.molnet.transformers.
reload (bool) – if True, the first call for a particular featurizer and splitter will cache the datasets to disk, and subsequent calls will reload the cached datasets.
data_dir (str) – a directory to save the raw data in
save_dir (str) – a directory to save the dataset in
References
- 1
Martins, Ines Filipa, et al. “A Bayesian approach to in silico blood-brain barrier penetration modeling.” Journal of chemical information and modeling 52.6 (2012): 1686-1697.
Cell Counting Datasets¶
- load_cell_counting(splitter: Optional[Union[deepchem.splits.splitters.Splitter, str]] = None, transformers: List[Union[deepchem.molnet.load_function.molnet_loader.TransformerGenerator, str]] = [], reload: bool = True, data_dir: Optional[str] = None, save_dir: Optional[str] = None, **kwargs) Tuple[List[str], Tuple[deepchem.data.datasets.Dataset, ...], List[transformers.Transformer]] [source]¶
Load Cell Counting dataset.
Loads the cell counting dataset from http://www.robots.ox.ac.uk/~vgg/research/counting/index_org.html.
- Parameters
splitter (Splitter or str) – the splitter to use for splitting the data into training, validation, and test sets. Alternatively you can pass one of the names from dc.molnet.splitters as a shortcut. If this is None, all the data will be included in a single dataset.
transformers (list of TransformerGenerators or strings) – the Transformers to apply to the data. Each one is specified by a TransformerGenerator or, as a shortcut, one of the names from dc.molnet.transformers.
reload (bool) – if True, the first call for a particular featurizer and splitter will cache the datasets to disk, and subsequent calls will reload the cached datasets.
data_dir (str) – a directory to save the raw data in
save_dir (str) – a directory to save the dataset in
Chembl Datasets¶
- load_chembl(featurizer: Union[deepchem.feat.base_classes.Featurizer, str] = 'ECFP', splitter: Optional[Union[deepchem.splits.splitters.Splitter, str]] = 'scaffold', transformers: List[Union[deepchem.molnet.load_function.molnet_loader.TransformerGenerator, str]] = ['normalization'], set: str = '5thresh', reload: bool = True, data_dir: Optional[str] = None, save_dir: Optional[str] = None, **kwargs) Tuple[List[str], Tuple[deepchem.data.datasets.Dataset, ...], List[transformers.Transformer]] [source]¶
Load the ChEMBL dataset.
This dataset is based on release 22.1 of the data from https://www.ebi.ac.uk/chembl/. Two subsets of the data are available, depending on the “set” argument. “sparse” is a large dataset with 244,245 compounds. As the name suggests, the data is extremely sparse, with most compounds having activity data for only one target. “5thresh” is a much smaller set (23,871 compounds) that includes only compounds with activity data for at least five targets.
- Parameters
featurizer (Featurizer or str) – the featurizer to use for processing the data. Alternatively you can pass one of the names from dc.molnet.featurizers as a shortcut.
splitter (Splitter or str) – the splitter to use for splitting the data into training, validation, and test sets. Alternatively you can pass one of the names from dc.molnet.splitters as a shortcut. If this is None, all the data will be included in a single dataset.
transformers (list of TransformerGenerators or strings) – the Transformers to apply to the data. Each one is specified by a TransformerGenerator or, as a shortcut, one of the names from dc.molnet.transformers.
set (str) – the subset to load, either “sparse” or “5thresh”
reload (bool) – if True, the first call for a particular featurizer and splitter will cache the datasets to disk, and subsequent calls will reload the cached datasets.
data_dir (str) – a directory to save the raw data in
save_dir (str) – a directory to save the dataset in
Chembl25 Datasets¶
- load_chembl25(featurizer: Union[deepchem.feat.base_classes.Featurizer, str] = 'ECFP', splitter: Optional[Union[deepchem.splits.splitters.Splitter, str]] = 'scaffold', transformers: List[Union[deepchem.molnet.load_function.molnet_loader.TransformerGenerator, str]] = ['normalization'], reload: bool = True, data_dir: Optional[str] = None, save_dir: Optional[str] = None, **kwargs) Tuple[List[str], Tuple[deepchem.data.datasets.Dataset, ...], List[transformers.Transformer]] [source]¶
Loads the ChEMBL25 dataset, featurizes it, and does a split.
- Parameters
featurizer (Featurizer or str) – the featurizer to use for processing the data. Alternatively you can pass one of the names from dc.molnet.featurizers as a shortcut.
splitter (Splitter or str) – the splitter to use for splitting the data into training, validation, and test sets. Alternatively you can pass one of the names from dc.molnet.splitters as a shortcut. If this is None, all the data will be included in a single dataset.
transformers (list of TransformerGenerators or strings) – the Transformers to apply to the data. Each one is specified by a TransformerGenerator or, as a shortcut, one of the names from dc.molnet.transformers.
reload (bool) – if True, the first call for a particular featurizer and splitter will cache the datasets to disk, and subsequent calls will reload the cached datasets.
data_dir (str) – a directory to save the raw data in
save_dir (str) – a directory to save the dataset in
Clearance Datasets¶
- load_clearance(featurizer: Union[deepchem.feat.base_classes.Featurizer, str] = 'ECFP', splitter: Optional[Union[deepchem.splits.splitters.Splitter, str]] = 'scaffold', transformers: List[Union[deepchem.molnet.load_function.molnet_loader.TransformerGenerator, str]] = ['log'], reload: bool = True, data_dir: Optional[str] = None, save_dir: Optional[str] = None, **kwargs) Tuple[List[str], Tuple[deepchem.data.datasets.Dataset, ...], List[transformers.Transformer]] [source]¶
Load clearance datasets.
- Parameters
featurizer (Featurizer or str) – the featurizer to use for processing the data. Alternatively you can pass one of the names from dc.molnet.featurizers as a shortcut.
splitter (Splitter or str) – the splitter to use for splitting the data into training, validation, and test sets. Alternatively you can pass one of the names from dc.molnet.splitters as a shortcut. If this is None, all the data will be included in a single dataset.
transformers (list of TransformerGenerators or strings) – the Transformers to apply to the data. Each one is specified by a TransformerGenerator or, as a shortcut, one of the names from dc.molnet.transformers.
reload (bool) – if True, the first call for a particular featurizer and splitter will cache the datasets to disk, and subsequent calls will reload the cached datasets.
data_dir (str) – a directory to save the raw data in
save_dir (str) – a directory to save the dataset in
Clintox Datasets¶
- load_clintox(featurizer: Union[deepchem.feat.base_classes.Featurizer, str] = 'ECFP', splitter: Optional[Union[deepchem.splits.splitters.Splitter, str]] = 'scaffold', transformers: List[Union[deepchem.molnet.load_function.molnet_loader.TransformerGenerator, str]] = ['balancing'], reload: bool = True, data_dir: Optional[str] = None, save_dir: Optional[str] = None, **kwargs) Tuple[List[str], Tuple[deepchem.data.datasets.Dataset, ...], List[transformers.Transformer]] [source]¶
Load ClinTox dataset
The ClinTox dataset compares drugs approved by the FDA and drugs that have failed clinical trials for toxicity reasons. The dataset includes two classification tasks for 1491 drug compounds with known chemical structures:
clinical trial toxicity (or absence of toxicity)
FDA approval status.
List of FDA-approved drugs are compiled from the SWEETLEAD database, and list of drugs that failed clinical trials for toxicity reasons are compiled from the Aggregate Analysis of ClinicalTrials.gov(AACT) database.
Random splitting is recommended for this dataset.
The raw data csv file contains columns below:
“smiles” - SMILES representation of the molecular structure
“FDA_APPROVED” - FDA approval status
“CT_TOX” - Clinical trial results
- Parameters
featurizer (Featurizer or str) – the featurizer to use for processing the data. Alternatively you can pass one of the names from dc.molnet.featurizers as a shortcut.
splitter (Splitter or str) – the splitter to use for splitting the data into training, validation, and test sets. Alternatively you can pass one of the names from dc.molnet.splitters as a shortcut. If this is None, all the data will be included in a single dataset.
transformers (list of TransformerGenerators or strings) – the Transformers to apply to the data. Each one is specified by a TransformerGenerator or, as a shortcut, one of the names from dc.molnet.transformers.
reload (bool) – if True, the first call for a particular featurizer and splitter will cache the datasets to disk, and subsequent calls will reload the cached datasets.
data_dir (str) – a directory to save the raw data in
save_dir (str) – a directory to save the dataset in
References
- 1
Gayvert, Kaitlyn M., Neel S. Madhukar, and Olivier Elemento. “A data-driven approach to predicting successes and failures of clinical trials.” Cell chemical biology 23.10 (2016): 1294-1301.
- 2
Artemov, Artem V., et al. “Integrated deep learned transcriptomic and structure-based predictor of clinical trials outcomes.” bioRxiv (2016): 095653.
- 3
Novick, Paul A., et al. “SWEETLEAD: an in silico database of approved drugs, regulated chemicals, and herbal isolates for computer-aided drug discovery.” PloS one 8.11 (2013): e79568.
- 4
Aggregate Analysis of ClincalTrials.gov (AACT) Database. https://www.ctti-clinicaltrials.org/aact-database
Delaney Datasets¶
- load_delaney(featurizer: Union[deepchem.feat.base_classes.Featurizer, str] = 'ECFP', splitter: Optional[Union[deepchem.splits.splitters.Splitter, str]] = 'scaffold', transformers: List[Union[deepchem.molnet.load_function.molnet_loader.TransformerGenerator, str]] = ['normalization'], reload: bool = True, data_dir: Optional[str] = None, save_dir: Optional[str] = None, **kwargs) Tuple[List[str], Tuple[deepchem.data.datasets.Dataset, ...], List[transformers.Transformer]] [source]¶
Load Delaney dataset
The Delaney (ESOL) dataset a regression dataset containing structures and water solubility data for 1128 compounds. The dataset is widely used to validate machine learning models on estimating solubility directly from molecular structures (as encoded in SMILES strings).
Scaffold splitting is recommended for this dataset.
The raw data csv file contains columns below:
“Compound ID” - Name of the compound
“smiles” - SMILES representation of the molecular structure
- “measured log solubility in mols per litre” - Log-scale water solubility
of the compound, used as label
- Parameters
featurizer (Featurizer or str) – the featurizer to use for processing the data. Alternatively you can pass one of the names from dc.molnet.featurizers as a shortcut.
splitter (Splitter or str) – the splitter to use for splitting the data into training, validation, and test sets. Alternatively you can pass one of the names from dc.molnet.splitters as a shortcut. If this is None, all the data will be included in a single dataset.
transformers (list of TransformerGenerators or strings) – the Transformers to apply to the data. Each one is specified by a TransformerGenerator or, as a shortcut, one of the names from dc.molnet.transformers.
reload (bool) – if True, the first call for a particular featurizer and splitter will cache the datasets to disk, and subsequent calls will reload the cached datasets.
data_dir (str) – a directory to save the raw data in
save_dir (str) – a directory to save the dataset in
References
- 1
Delaney, John S. “ESOL: estimating aqueous solubility directly from molecular structure.” Journal of chemical information and computer sciences 44.3 (2004): 1000-1005.
Factors Datasets¶
- load_factors(shard_size=2000, featurizer=None, split=None, reload=True)[source]¶
Loads FACTOR dataset; does not do train/test split
The Factors dataset is an in-house dataset from Merck that was first introduced in the following paper: Ramsundar, Bharath, et al. “Is multitask deep learning practical for pharma?.” Journal of chemical information and modeling 57.8 (2017): 2068-2076.
It contains 1500 Merck in-house compounds that were measured for IC50 of inhibition on 12 serine proteases. Unlike most of the other datasets featured in MoleculeNet, the Factors collection does not have structures for the compounds tested since they were proprietary Merck compounds. However, the collection does feature pre-computed descriptors for these compounds.
Note that the original train/valid/test split from the source data was preserved here, so this function doesn’t allow for alternate modes of splitting. Similarly, since the source data came pre-featurized, it is not possible to apply alternative featurizations.
- Parameters
shard_size (int, optional) – Size of the DiskDataset shards to write on disk
featurizer (optional) – Ignored since featurization pre-computed
split (optional) – Ignored since split pre-computed
reload (bool, optional) – Whether to automatically re-load from disk
Freesolv Dataset¶
- load_freesolv(featurizer: typing.Union[deepchem.feat.base_classes.Featurizer, str] = MATFeaturizer[], splitter: typing.Optional[typing.Union[deepchem.splits.splitters.Splitter, str]] = 'random', transformers: typing.List[typing.Union[deepchem.molnet.load_function.molnet_loader.TransformerGenerator, str]] = ['normalization'], reload: bool = True, data_dir: typing.Optional[str] = None, save_dir: typing.Optional[str] = None, **kwargs) Tuple[List[str], Tuple[deepchem.data.datasets.Dataset, ...], List[transformers.Transformer]] [source]¶
Load Freesolv dataset
The FreeSolv dataset is a collection of experimental and calculated hydration free energies for small molecules in water, along with their experiemental values. Here, we are using a modified version of the dataset with the molecule smile string and the corresponding experimental hydration free energies.
Random splitting is recommended for this dataset.
The raw data csv file contains columns below:
“mol” - SMILES representation of the molecular structure
“y” - Experimental hydration free energy
- Parameters
featurizer (Featurizer or str) – the featurizer to use for processing the data. Alternatively you can pass one of the names from dc.molnet.featurizers as a shortcut.
splitter (Splitter or str) – the splitter to use for splitting the data into training, validation, and test sets. Alternatively you can pass one of the names from dc.molnet.splitters as a shortcut. If this is None, all the data will be included in a single dataset.
transformers (list of TransformerGenerators or strings) – the Transformers to apply to the data. Each one is specified by a TransformerGenerator or, as a shortcut, one of the names from dc.molnet.transformers.
reload (bool) – if True, the first call for a particular featurizer and splitter will cache the datasets to disk, and subsequent calls will reload the cached datasets.
data_dir (str) – a directory to save the raw data in
save_dir (str) – a directory to save the dataset in
References
- 1
Łukasz Maziarka, et al. “Molecule Attention Transformer.” NeurIPS 2019 arXiv:2002.08264v1 [cs.LG].
- 2
Mobley DL, Guthrie JP. FreeSolv: a database of experimental and calculated hydration free energies, with input files. J Comput Aided Mol Des. 2014;28(7):711-720. doi:10.1007/s10822-014-9747-x
HIV Datasets¶
- load_hiv(featurizer: Union[deepchem.feat.base_classes.Featurizer, str] = 'ECFP', splitter: Optional[Union[deepchem.splits.splitters.Splitter, str]] = 'scaffold', transformers: List[Union[deepchem.molnet.load_function.molnet_loader.TransformerGenerator, str]] = ['balancing'], reload: bool = True, data_dir: Optional[str] = None, save_dir: Optional[str] = None, **kwargs) Tuple[List[str], Tuple[deepchem.data.datasets.Dataset, ...], List[transformers.Transformer]] [source]¶
Load HIV dataset
The HIV dataset was introduced by the Drug Therapeutics Program (DTP) AIDS Antiviral Screen, which tested the ability to inhibit HIV replication for over 40,000 compounds. Screening results were evaluated and placed into three categories: confirmed inactive (CI),confirmed active (CA) and confirmed moderately active (CM). We further combine the latter two labels, making it a classification task between inactive (CI) and active (CA and CM).
Scaffold splitting is recommended for this dataset.
The raw data csv file contains columns below:
“smiles”: SMILES representation of the molecular structure
“activity”: Three-class labels for screening results: CI/CM/CA
“HIV_active”: Binary labels for screening results: 1 (CA/CM) and 0 (CI)
- Parameters
featurizer (Featurizer or str) – the featurizer to use for processing the data. Alternatively you can pass one of the names from dc.molnet.featurizers as a shortcut.
splitter (Splitter or str) – the splitter to use for splitting the data into training, validation, and test sets. Alternatively you can pass one of the names from dc.molnet.splitters as a shortcut. If this is None, all the data will be included in a single dataset.
transformers (list of TransformerGenerators or strings) – the Transformers to apply to the data. Each one is specified by a TransformerGenerator or, as a shortcut, one of the names from dc.molnet.transformers.
reload (bool) – if True, the first call for a particular featurizer and splitter will cache the datasets to disk, and subsequent calls will reload the cached datasets.
data_dir (str) – a directory to save the raw data in
save_dir (str) – a directory to save the dataset in
References
- 1
AIDS Antiviral Screen Data. https://wiki.nci.nih.gov/display/NCIDTPdata/AIDS+Antiviral+Screen+Data
HOPV Datasets¶
HOPV stands for the Harvard Organic Photovoltaic Dataset.
- load_hopv(featurizer: Union[deepchem.feat.base_classes.Featurizer, str] = 'ECFP', splitter: Optional[Union[deepchem.splits.splitters.Splitter, str]] = 'scaffold', transformers: List[Union[deepchem.molnet.load_function.molnet_loader.TransformerGenerator, str]] = ['normalization'], reload: bool = True, data_dir: Optional[str] = None, save_dir: Optional[str] = None, **kwargs) Tuple[List[str], Tuple[deepchem.data.datasets.Dataset, ...], List[transformers.Transformer]] [source]¶
Load HOPV datasets. Does not do train/test split
The HOPV datasets consist of the “Harvard Organic Photovoltaic Dataset. This dataset includes 350 small molecules and polymers that were utilized as p-type materials in OPVs. Experimental properties include: HOMO [a.u.], LUMO [a.u.], Electrochemical gap [a.u.], Optical gap [a.u.], Power conversion efficiency [%], Open circuit potential [V], Short circuit current density [mA/cm^2], and fill factor [%]. Theoretical calculations in the original dataset have been removed (for now).
Lopez, Steven A., et al. “The Harvard organic photovoltaic dataset.” Scientific data 3.1 (2016): 1-7.
- Parameters
featurizer (Featurizer or str) – the featurizer to use for processing the data. Alternatively you can pass one of the names from dc.molnet.featurizers as a shortcut.
splitter (Splitter or str) – the splitter to use for splitting the data into training, validation, and test sets. Alternatively you can pass one of the names from dc.molnet.splitters as a shortcut. If this is None, all the data will be included in a single dataset.
transformers (list of TransformerGenerators or strings) – the Transformers to apply to the data. Each one is specified by a TransformerGenerator or, as a shortcut, one of the names from dc.molnet.transformers.
reload (bool) – if True, the first call for a particular featurizer and splitter will cache the datasets to disk, and subsequent calls will reload the cached datasets.
data_dir (str) – a directory to save the raw data in
save_dir (str) – a directory to save the dataset in
HPPB Datasets¶
- load_hppb(featurizer: Union[deepchem.feat.base_classes.Featurizer, str] = 'ECFP', splitter: Optional[Union[deepchem.splits.splitters.Splitter, str]] = 'scaffold', transformers: List[Union[deepchem.molnet.load_function.molnet_loader.TransformerGenerator, str]] = ['log'], reload: bool = True, data_dir: Optional[str] = None, save_dir: Optional[str] = None, **kwargs) Tuple[List[str], Tuple[deepchem.data.datasets.Dataset, ...], List[transformers.Transformer]] [source]¶
Loads the thermodynamic solubility datasets.
- Parameters
featurizer (Featurizer or str) – the featurizer to use for processing the data. Alternatively you can pass one of the names from dc.molnet.featurizers as a shortcut.
splitter (Splitter or str) – the splitter to use for splitting the data into training, validation, and test sets. Alternatively you can pass one of the names from dc.molnet.splitters as a shortcut. If this is None, all the data will be included in a single dataset.
transformers (list of TransformerGenerators or strings) – the Transformers to apply to the data. Each one is specified by a TransformerGenerator or, as a shortcut, one of the names from dc.molnet.transformers.
reload (bool) – if True, the first call for a particular featurizer and splitter will cache the datasets to disk, and subsequent calls will reload the cached datasets.
data_dir (str) – a directory to save the raw data in
save_dir (str) – a directory to save the dataset in
KAGGLE Datasets¶
- load_kaggle(shard_size=2000, featurizer=None, split=None, reload=True)[source]¶
Loads kaggle datasets. Generates if not stored already.
The Kaggle dataset is an in-house dataset from Merck that was first introduced in the following paper:
Ma, Junshui, et al. “Deep neural nets as a method for quantitative structure–activity relationships.” Journal of chemical information and modeling 55.2 (2015): 263-274.
It contains 100,000 unique Merck in-house compounds that were measured on 15 enzyme inhibition and ADME/TOX datasets. Unlike most of the other datasets featured in MoleculeNet, the Kaggle collection does not have structures for the compounds tested since they were proprietary Merck compounds. However, the collection does feature pre-computed descriptors for these compounds.
Note that the original train/valid/test split from the source data was preserved here, so this function doesn’t allow for alternate modes of splitting. Similarly, since the source data came pre-featurized, it is not possible to apply alternative featurizations.
- Parameters
shard_size (int, optional) – Size of the DiskDataset shards to write on disk
featurizer (optional) – Ignored since featurization pre-computed
split (optional) – Ignored since split pre-computed
reload (bool, optional) – Whether to automatically re-load from disk
Kinase Datasets¶
- load_kinase(shard_size=2000, featurizer=None, split=None, reload=True)[source]¶
Loads Kinase datasets, does not do train/test split
The Kinase dataset is an in-house dataset from Merck that was first introduced in the following paper: Ramsundar, Bharath, et al. “Is multitask deep learning practical for pharma?.” Journal of chemical information and modeling 57.8 (2017): 2068-2076.
It contains 2500 Merck in-house compounds that were measured for IC50 of inhibition on 99 protein kinases. Unlike most of the other datasets featured in MoleculeNet, the Kinase collection does not have structures for the compounds tested since they were proprietary Merck compounds. However, the collection does feature pre-computed descriptors for these compounds.
Note that the original train/valid/test split from the source data was preserved here, so this function doesn’t allow for alternate modes of splitting. Similarly, since the source data came pre-featurized, it is not possible to apply alternative featurizations.
- Parameters
shard_size (int, optional) – Size of the DiskDataset shards to write on disk
featurizer (optional) – Ignored since featurization pre-computed
split (optional) – Ignored since split pre-computed
reload (bool, optional) – Whether to automatically re-load from disk
Lipo Datasets¶
- load_lipo(featurizer: Union[deepchem.feat.base_classes.Featurizer, str] = 'ECFP', splitter: Optional[Union[deepchem.splits.splitters.Splitter, str]] = 'scaffold', transformers: List[Union[deepchem.molnet.load_function.molnet_loader.TransformerGenerator, str]] = ['normalization'], reload: bool = True, data_dir: Optional[str] = None, save_dir: Optional[str] = None, **kwargs) Tuple[List[str], Tuple[deepchem.data.datasets.Dataset, ...], List[transformers.Transformer]] [source]¶
Load Lipophilicity dataset
Lipophilicity is an important feature of drug molecules that affects both membrane permeability and solubility. The lipophilicity dataset, curated from ChEMBL database, provides experimental results of octanol/water distribution coefficient (logD at pH 7.4) of 4200 compounds.
Scaffold splitting is recommended for this dataset.
The raw data csv file contains columns below:
“smiles” - SMILES representation of the molecular structure
- “exp” - Measured octanol/water distribution coefficient (logD) of the
compound, used as label
- Parameters
featurizer (Featurizer or str) – the featurizer to use for processing the data. Alternatively you can pass one of the names from dc.molnet.featurizers as a shortcut.
splitter (Splitter or str) – the splitter to use for splitting the data into training, validation, and test sets. Alternatively you can pass one of the names from dc.molnet.splitters as a shortcut. If this is None, all the data will be included in a single dataset.
transformers (list of TransformerGenerators or strings) – the Transformers to apply to the data. Each one is specified by a TransformerGenerator or, as a shortcut, one of the names from dc.molnet.transformers.
reload (bool) – if True, the first call for a particular featurizer and splitter will cache the datasets to disk, and subsequent calls will reload the cached datasets.
data_dir (str) – a directory to save the raw data in
save_dir (str) – a directory to save the dataset in
References
- 1
Hersey, A. ChEMBL Deposited Data Set - AZ dataset; 2015. https://doi.org/10.6019/chembl3301361
Materials Datasets¶
Materials datasets include inorganic crystal structures, chemical compositions, and target properties like formation energies and band gaps. Machine learning problems in materials science commonly include predicting the value of a continuous (regression) or categorical (classification) property of a material based on its chemical composition or crystal structure. “Inverse design” is also of great interest, in which ML methods generate crystal structures that have a desired property. Other areas where ML is applicable in materials include: discovering new or modified phenomenological models that describe material behavior
- load_bandgap(featurizer: typing.Union[deepchem.feat.base_classes.Featurizer, str] = ElementPropertyFingerprint[data_source='matminer'], splitter: typing.Optional[typing.Union[deepchem.splits.splitters.Splitter, str]] = 'random', transformers: typing.List[typing.Union[deepchem.molnet.load_function.molnet_loader.TransformerGenerator, str]] = ['normalization'], reload: bool = True, data_dir: typing.Optional[str] = None, save_dir: typing.Optional[str] = None, **kwargs) Tuple[List[str], Tuple[deepchem.data.datasets.Dataset, ...], List[transformers.Transformer]] [source]¶
Load band gap dataset.
Contains 4604 experimentally measured band gaps for inorganic crystal structure compositions. In benchmark studies, random forest models achieved a mean average error of 0.45 eV during five-fold nested cross validation on this dataset.
For more details on the dataset see [1]_. For more details on previous benchmarks for this dataset, see [2]_.
- Parameters
featurizer (Featurizer or str) – the featurizer to use for processing the data. Alternatively you can pass one of the names from dc.molnet.featurizers as a shortcut.
splitter (Splitter or str) – the splitter to use for splitting the data into training, validation, and test sets. Alternatively you can pass one of the names from dc.molnet.splitters as a shortcut. If this is None, all the data will be included in a single dataset.
transformers (list of TransformerGenerators or strings) – the Transformers to apply to the data. Each one is specified by a TransformerGenerator or, as a shortcut, one of the names from dc.molnet.transformers.
reload (bool) – if True, the first call for a particular featurizer and splitter will cache the datasets to disk, and subsequent calls will reload the cached datasets.
data_dir (str) – a directory to save the raw data in
save_dir (str) – a directory to save the dataset in
- Returns
tasks, datasets, transformers –
- taskslist
Column names corresponding to machine learning target variables.
- datasetstuple
train, validation, test splits of data as
deepchem.data.datasets.Dataset
instances.- transformerslist
deepchem.trans.transformers.Transformer
instances applied to dataset.
- Return type
tuple
References
- 1
Zhuo, Y. et al. “Predicting the Band Gaps of Inorganic Solids by Machine Learning.” J. Phys. Chem. Lett. (2018) DOI: 10.1021/acs.jpclett.8b00124.
- 2
Dunn, A. et al. “Benchmarking Materials Property Prediction Methods: The Matbench Test Set and Automatminer Reference Algorithm.” https://arxiv.org/abs/2005.00707 (2020)
Examples
>>> >> import deepchem as dc >> tasks, datasets, transformers = dc.molnet.load_bandgap() >> train_dataset, val_dataset, test_dataset = datasets >> n_tasks = len(tasks) >> n_features = train_dataset.get_data_shape()[0] >> model = dc.models.MultitaskRegressor(n_tasks, n_features)
- load_perovskite(featurizer: typing.Union[deepchem.feat.base_classes.Featurizer, str] = DummyFeaturizer[], splitter: typing.Optional[typing.Union[deepchem.splits.splitters.Splitter, str]] = 'random', transformers: typing.List[typing.Union[deepchem.molnet.load_function.molnet_loader.TransformerGenerator, str]] = ['normalization'], reload: bool = True, data_dir: typing.Optional[str] = None, save_dir: typing.Optional[str] = None, **kwargs) Tuple[List[str], Tuple[deepchem.data.datasets.Dataset, ...], List[transformers.Transformer]] [source]¶
Load perovskite dataset.
Contains 18928 perovskite structures and their formation energies. In benchmark studies, random forest models and crystal graph neural networks achieved mean average error of 0.23 and 0.05 eV/atom, respectively, during five-fold nested cross validation on this dataset.
For more details on the dataset see [1]_. For more details on previous benchmarks for this dataset, see [2]_.
- Parameters
featurizer (Featurizer or str) – the featurizer to use for processing the data. Alternatively you can pass one of the names from dc.molnet.featurizers as a shortcut.
splitter (Splitter or str) – the splitter to use for splitting the data into training, validation, and test sets. Alternatively you can pass one of the names from dc.molnet.splitters as a shortcut. If this is None, all the data will be included in a single dataset.
transformers (list of TransformerGenerators or strings) – the Transformers to apply to the data. Each one is specified by a TransformerGenerator or, as a shortcut, one of the names from dc.molnet.transformers.
reload (bool) – if True, the first call for a particular featurizer and splitter will cache the datasets to disk, and subsequent calls will reload the cached datasets.
data_dir (str) – a directory to save the raw data in
save_dir (str) – a directory to save the dataset in
- Returns
tasks, datasets, transformers –
- taskslist
Column names corresponding to machine learning target variables.
- datasetstuple
train, validation, test splits of data as
deepchem.data.datasets.Dataset
instances.- transformerslist
deepchem.trans.transformers.Transformer
instances applied to dataset.
- Return type
tuple
References
- 1
Castelli, I. et al. “New cubic perovskites for one- and two-photon water splitting using the computational materials repository.” Energy Environ. Sci., (2012), 5, 9034-9043 DOI: 10.1039/C2EE22341D.
- 2
Dunn, A. et al. “Benchmarking Materials Property Prediction Methods: The Matbench Test Set and Automatminer Reference Algorithm.” https://arxiv.org/abs/2005.00707 (2020)
Examples
>>> import deepchem as dc >>> tasks, datasets, transformers = dc.molnet.load_perovskite() >>> train_dataset, val_dataset, test_dataset = datasets >>> model = dc.models.CGCNNModel(mode='regression', batch_size=32, learning_rate=0.001)
- load_mp_formation_energy(featurizer: typing.Union[deepchem.feat.base_classes.Featurizer, str] = SineCoulombMatrix[max_atoms=100, flatten=True], splitter: typing.Optional[typing.Union[deepchem.splits.splitters.Splitter, str]] = 'random', transformers: typing.List[typing.Union[deepchem.molnet.load_function.molnet_loader.TransformerGenerator, str]] = ['normalization'], reload: bool = True, data_dir: typing.Optional[str] = None, save_dir: typing.Optional[str] = None, **kwargs) Tuple[List[str], Tuple[deepchem.data.datasets.Dataset, ...], List[transformers.Transformer]] [source]¶
Load mp formation energy dataset.
Contains 132752 calculated formation energies and inorganic crystal structures from the Materials Project database. In benchmark studies, random forest models achieved a mean average error of 0.116 eV/atom during five-folded nested cross validation on this dataset.
For more details on the dataset see [1]_. For more details on previous benchmarks for this dataset, see [2]_.
- Parameters
featurizer (Featurizer or str) – the featurizer to use for processing the data. Alternatively you can pass one of the names from dc.molnet.featurizers as a shortcut.
splitter (Splitter or str) – the splitter to use for splitting the data into training, validation, and test sets. Alternatively you can pass one of the names from dc.molnet.splitters as a shortcut. If this is None, all the data will be included in a single dataset.
transformers (list of TransformerGenerators or strings) – the Transformers to apply to the data. Each one is specified by a TransformerGenerator or, as a shortcut, one of the names from dc.molnet.transformers.
reload (bool) – if True, the first call for a particular featurizer and splitter will cache the datasets to disk, and subsequent calls will reload the cached datasets.
data_dir (str) – a directory to save the raw data in
save_dir (str) – a directory to save the dataset in
- Returns
tasks, datasets, transformers –
- taskslist
Column names corresponding to machine learning target variables.
- datasetstuple
train, validation, test splits of data as
deepchem.data.datasets.Dataset
instances.- transformerslist
deepchem.trans.transformers.Transformer
instances applied to dataset.
- Return type
tuple
References
- 1
A. Jain*, S.P. Ong*, et al. (*=equal contributions) The Materials Project: A materials genome approach to accelerating materials innovation APL Materials, 2013, 1(1), 011002. doi:10.1063/1.4812323 (2013).
- 2
Dunn, A. et al. “Benchmarking Materials Property Prediction Methods: The Matbench Test Set and Automatminer Reference Algorithm.” https://arxiv.org/abs/2005.00707 (2020)
Examples
>>> >> import deepchem as dc >> tasks, datasets, transformers = dc.molnet.load_mp_formation_energy() >> train_dataset, val_dataset, test_dataset = datasets >> n_tasks = len(tasks) >> n_features = train_dataset.get_data_shape()[0] >> model = dc.models.MultitaskRegressor(n_tasks, n_features)
- load_mp_metallicity(featurizer: typing.Union[deepchem.feat.base_classes.Featurizer, str] = SineCoulombMatrix[max_atoms=100, flatten=True], splitter: typing.Optional[typing.Union[deepchem.splits.splitters.Splitter, str]] = 'random', transformers: typing.List[typing.Union[deepchem.molnet.load_function.molnet_loader.TransformerGenerator, str]] = ['balancing'], reload: bool = True, data_dir: typing.Optional[str] = None, save_dir: typing.Optional[str] = None, **kwargs) Tuple[List[str], Tuple[deepchem.data.datasets.Dataset, ...], List[transformers.Transformer]] [source]¶
Load mp formation energy dataset.
Contains 106113 inorganic crystal structures from the Materials Project database labeled as metals or nonmetals. In benchmark studies, random forest models achieved a mean ROC-AUC of 0.9 during five-folded nested cross validation on this dataset.
For more details on the dataset see [1]_. For more details on previous benchmarks for this dataset, see [2]_.
- Parameters
featurizer (Featurizer or str) – the featurizer to use for processing the data. Alternatively you can pass one of the names from dc.molnet.featurizers as a shortcut.
splitter (Splitter or str) – the splitter to use for splitting the data into training, validation, and test sets. Alternatively you can pass one of the names from dc.molnet.splitters as a shortcut. If this is None, all the data will be included in a single dataset.
transformers (list of TransformerGenerators or strings) – the Transformers to apply to the data. Each one is specified by a TransformerGenerator or, as a shortcut, one of the names from dc.molnet.transformers.
reload (bool) – if True, the first call for a particular featurizer and splitter will cache the datasets to disk, and subsequent calls will reload the cached datasets.
data_dir (str) – a directory to save the raw data in
save_dir (str) – a directory to save the dataset in
- Returns
tasks, datasets, transformers –
- taskslist
Column names corresponding to machine learning target variables.
- datasetstuple
train, validation, test splits of data as
deepchem.data.datasets.Dataset
instances.- transformerslist
deepchem.trans.transformers.Transformer
instances applied to dataset.
- Return type
tuple
References
- 1
A. Jain*, S.P. Ong*, et al. (*=equal contributions) The Materials Project: A materials genome approach to accelerating materials innovation APL Materials, 2013, 1(1), 011002. doi:10.1063/1.4812323 (2013).
- 2
Dunn, A. et al. “Benchmarking Materials Property Prediction Methods: The Matbench Test Set and Automatminer Reference Algorithm.” https://arxiv.org/abs/2005.00707 (2020)
Examples
>>> >> import deepchem as dc >> tasks, datasets, transformers = dc.molnet.load_mp_metallicity() >> train_dataset, val_dataset, test_dataset = datasets >> n_tasks = len(tasks) >> n_features = train_dataset.get_data_shape()[0] >> model = dc.models.MultitaskRegressor(n_tasks, n_features)
MUV Datasets¶
- load_muv(featurizer: Union[deepchem.feat.base_classes.Featurizer, str] = 'ECFP', splitter: Optional[Union[deepchem.splits.splitters.Splitter, str]] = 'scaffold', transformers: List[Union[deepchem.molnet.load_function.molnet_loader.TransformerGenerator, str]] = ['balancing'], reload: bool = True, data_dir: Optional[str] = None, save_dir: Optional[str] = None, **kwargs) Tuple[List[str], Tuple[deepchem.data.datasets.Dataset, ...], List[transformers.Transformer]] [source]¶
Load MUV dataset
The Maximum Unbiased Validation (MUV) group is a benchmark dataset selected from PubChem BioAssay by applying a refined nearest neighbor analysis.
The MUV dataset contains 17 challenging tasks for around 90 thousand compounds and is specifically designed for validation of virtual screening techniques.
Scaffold splitting is recommended for this dataset.
The raw data csv file contains columns below:
“mol_id” - PubChem CID of the compound
“smiles” - SMILES representation of the molecular structure
“MUV-XXX” - Measured results (Active/Inactive) for bioassays
- Parameters
featurizer (Featurizer or str) – the featurizer to use for processing the data. Alternatively you can pass one of the names from dc.molnet.featurizers as a shortcut.
splitter (Splitter or str) – the splitter to use for splitting the data into training, validation, and test sets. Alternatively you can pass one of the names from dc.molnet.splitters as a shortcut. If this is None, all the data will be included in a single dataset.
transformers (list of TransformerGenerators or strings) – the Transformers to apply to the data. Each one is specified by a TransformerGenerator or, as a shortcut, one of the names from dc.molnet.transformers.
reload (bool) – if True, the first call for a particular featurizer and splitter will cache the datasets to disk, and subsequent calls will reload the cached datasets.
data_dir (str) – a directory to save the raw data in
save_dir (str) – a directory to save the dataset in
References
- 1
Rohrer, Sebastian G., and Knut Baumann. “Maximum unbiased validation (MUV) data sets for virtual screening based on PubChem bioactivity data.” Journal of chemical information and modeling 49.2 (2009): 169-184.
NCI Datasets¶
- load_nci(featurizer: Union[deepchem.feat.base_classes.Featurizer, str] = 'ECFP', splitter: Optional[Union[deepchem.splits.splitters.Splitter, str]] = 'random', transformers: List[Union[deepchem.molnet.load_function.molnet_loader.TransformerGenerator, str]] = ['normalization'], reload: bool = True, data_dir: Optional[str] = None, save_dir: Optional[str] = None, **kwargs) Tuple[List[str], Tuple[deepchem.data.datasets.Dataset, ...], List[transformers.Transformer]] [source]¶
Load NCI dataset.
- Parameters
featurizer (Featurizer or str) – the featurizer to use for processing the data. Alternatively you can pass one of the names from dc.molnet.featurizers as a shortcut.
splitter (Splitter or str) – the splitter to use for splitting the data into training, validation, and test sets. Alternatively you can pass one of the names from dc.molnet.splitters as a shortcut. If this is None, all the data will be included in a single dataset.
transformers (list of TransformerGenerators or strings) – the Transformers to apply to the data. Each one is specified by a TransformerGenerator or, as a shortcut, one of the names from dc.molnet.transformers.
reload (bool) – if True, the first call for a particular featurizer and splitter will cache the datasets to disk, and subsequent calls will reload the cached datasets.
data_dir (str) – a directory to save the raw data in
save_dir (str) – a directory to save the dataset in
PCBA Datasets¶
- load_pcba(featurizer: Union[deepchem.feat.base_classes.Featurizer, str] = 'ECFP', splitter: Optional[Union[deepchem.splits.splitters.Splitter, str]] = 'scaffold', transformers: List[Union[deepchem.molnet.load_function.molnet_loader.TransformerGenerator, str]] = ['balancing'], reload: bool = True, data_dir: Optional[str] = None, save_dir: Optional[str] = None, **kwargs) Tuple[List[str], Tuple[deepchem.data.datasets.Dataset, ...], List[transformers.Transformer]] [source]¶
Load PCBA dataset
PubChem BioAssay (PCBA) is a database consisting of biological activities of small molecules generated by high-throughput screening. We use a subset of PCBA, containing 128 bioassays measured over 400 thousand compounds, used by previous work to benchmark machine learning methods.
Random splitting is recommended for this dataset.
The raw data csv file contains columns below:
“mol_id” - PubChem CID of the compound
“smiles” - SMILES representation of the molecular structure
- “PCBA-XXX” - Measured results (Active/Inactive) for bioassays:
search for the assay ID at https://pubchem.ncbi.nlm.nih.gov/search/#collection=bioassays for details
- Parameters
featurizer (Featurizer or str) – the featurizer to use for processing the data. Alternatively you can pass one of the names from dc.molnet.featurizers as a shortcut.
splitter (Splitter or str) – the splitter to use for splitting the data into training, validation, and test sets. Alternatively you can pass one of the names from dc.molnet.splitters as a shortcut. If this is None, all the data will be included in a single dataset.
transformers (list of TransformerGenerators or strings) – the Transformers to apply to the data. Each one is specified by a TransformerGenerator or, as a shortcut, one of the names from dc.molnet.transformers.
reload (bool) – if True, the first call for a particular featurizer and splitter will cache the datasets to disk, and subsequent calls will reload the cached datasets.
data_dir (str) – a directory to save the raw data in
save_dir (str) – a directory to save the dataset in
References
- 1
Wang, Yanli, et al. “PubChem’s BioAssay database.” Nucleic acids research 40.D1 (2011): D400-D412.
PDBBIND Datasets¶
- load_pdbbind(featurizer: deepchem.feat.base_classes.ComplexFeaturizer, splitter: Optional[Union[deepchem.splits.splitters.Splitter, str]] = 'random', transformers: List[Union[deepchem.molnet.load_function.molnet_loader.TransformerGenerator, str]] = ['normalization'], reload: bool = True, data_dir: Optional[str] = None, save_dir: Optional[str] = None, pocket: bool = True, set_name: str = 'core', **kwargs) Tuple[List[str], Tuple[deepchem.data.datasets.Dataset, ...], List[transformers.Transformer]] [source]¶
Load PDBBind dataset.
The PDBBind dataset includes experimental binding affinity data and structures for 4852 protein-ligand complexes from the “refined set” and 12800 complexes from the “general set” in PDBBind v2019 and 193 complexes from the “core set” in PDBBind v2013. The refined set removes data with obvious problems in 3D structure, binding data, or other aspects and should therefore be a better starting point for docking/scoring studies. Details on the criteria used to construct the refined set can be found in [4]_. The general set does not include the refined set. The core set is a subset of the refined set that is not updated annually.
Random splitting is recommended for this dataset.
The raw dataset contains the columns below:
“ligand” - SDF of the molecular structure
“protein” - PDB of the protein structure
“CT_TOX” - Clinical trial results
- Parameters
featurizer (ComplexFeaturizer or str) – the complex featurizer to use for processing the data. Alternatively you can pass one of the names from dc.molnet.featurizers as a shortcut.
splitter (Splitter or str) – the splitter to use for splitting the data into training, validation, and test sets. Alternatively you can pass one of the names from dc.molnet.splitters as a shortcut. If this is None, all the data will be included in a single dataset.
transformers (list of TransformerGenerators or strings) – the Transformers to apply to the data. Each one is specified by a TransformerGenerator or, as a shortcut, one of the names from dc.molnet.transformers.
reload (bool) – if True, the first call for a particular featurizer and splitter will cache the datasets to disk, and subsequent calls will reload the cached datasets.
data_dir (str) – a directory to save the raw data in
save_dir (str) – a directory to save the dataset in
pocket (bool (default True)) – If true, use only the binding pocket for featurization.
set_name (str (default 'core')) – Name of dataset to download. ‘refined’, ‘general’, and ‘core’ are supported.
- Returns
tasks, datasets, transformers –
- tasks: list
Column names corresponding to machine learning target variables.
- datasets: tuple
train, validation, test splits of data as
deepchem.data.datasets.Dataset
instances.- transformers: list
deepchem.trans.transformers.Transformer
instances applied to dataset.
- Return type
tuple
References
- 1
Liu, Z.H. et al. Acc. Chem. Res. 2017, 50, 302-309. (PDBbind v.2016)
- 2
Liu, Z.H. et al. Bioinformatics, 2015, 31, 405-412. (PDBbind v.2014)
- 3
Li, Y. et al. J. Chem. Inf. Model., 2014, 54, 1700-1716.(PDBbind v.2013)
- 4
Cheng, T.J. et al. J. Chem. Inf. Model., 2009, 49, 1079-1093. (PDBbind v.2009)
- 5
Wang, R.X. et al. J. Med. Chem., 2005, 48, 4111-4119. (Original release)
- 6
Wang, R.X. et al. J. Med. Chem., 2004, 47, 2977-2980. (Original release)
PPB Datasets¶
- load_ppb(featurizer: Union[deepchem.feat.base_classes.Featurizer, str] = 'ECFP', splitter: Optional[Union[deepchem.splits.splitters.Splitter, str]] = 'scaffold', transformers: List[Union[deepchem.molnet.load_function.molnet_loader.TransformerGenerator, str]] = ['normalization'], reload: bool = True, data_dir: Optional[str] = None, save_dir: Optional[str] = None, **kwargs) Tuple[List[str], Tuple[deepchem.data.datasets.Dataset, ...], List[transformers.Transformer]] [source]¶
Load PPB datasets.
- Parameters
featurizer (Featurizer or str) – the featurizer to use for processing the data. Alternatively you can pass one of the names from dc.molnet.featurizers as a shortcut.
splitter (Splitter or str) – the splitter to use for splitting the data into training, validation, and test sets. Alternatively you can pass one of the names from dc.molnet.splitters as a shortcut. If this is None, all the data will be included in a single dataset.
transformers (list of TransformerGenerators or strings) – the Transformers to apply to the data. Each one is specified by a TransformerGenerator or, as a shortcut, one of the names from dc.molnet.transformers.
reload (bool) – if True, the first call for a particular featurizer and splitter will cache the datasets to disk, and subsequent calls will reload the cached datasets.
data_dir (str) – a directory to save the raw data in
save_dir (str) – a directory to save the dataset in
QM7 Datasets¶
- load_qm7(featurizer: typing.Union[deepchem.feat.base_classes.Featurizer, str] = CoulombMatrix[max_atoms=23, remove_hydrogens=False, randomize=False, upper_tri=False, n_samples=1, seed=None], splitter: typing.Optional[typing.Union[deepchem.splits.splitters.Splitter, str]] = 'random', transformers: typing.List[typing.Union[deepchem.molnet.load_function.molnet_loader.TransformerGenerator, str]] = ['normalization'], reload: bool = True, data_dir: typing.Optional[str] = None, save_dir: typing.Optional[str] = None, **kwargs) Tuple[List[str], Tuple[deepchem.data.datasets.Dataset, ...], List[transformers.Transformer]] [source]¶
Load QM7 dataset
QM7 is a subset of GDB-13 (a database of nearly 1 billion stable and synthetically accessible organic molecules) containing up to 7 heavy atoms C, N, O, and S. The 3D Cartesian coordinates of the most stable conformations and their atomization energies were determined using ab-initio density functional theory (PBE0/tier2 basis set). This dataset also provided Coulomb matrices as calculated in [Rupp et al. PRL, 2012]:
Stratified splitting is recommended for this dataset.
The data file (.mat format, we recommend using scipy.io.loadmat for python users to load this original data) contains five arrays:
“X” - (7165 x 23 x 23), Coulomb matrices
“T” - (7165), atomization energies (unit: kcal/mol)
- “P” - (5 x 1433), cross-validation splits as used in [Montavon et al.
NIPS, 2012]
“Z” - (7165 x 23), atomic charges
- “R” - (7165 x 23 x 3), cartesian coordinate (unit: Bohr) of each atom in
the molecules
- Parameters
featurizer (Featurizer or str) – the featurizer to use for processing the data. Alternatively you can pass one of the names from dc.molnet.featurizers as a shortcut.
splitter (Splitter or str) – the splitter to use for splitting the data into training, validation, and test sets. Alternatively you can pass one of the names from dc.molnet.splitters as a shortcut. If this is None, all the data will be included in a single dataset.
transformers (list of TransformerGenerators or strings) – the Transformers to apply to the data. Each one is specified by a TransformerGenerator or, as a shortcut, one of the names from dc.molnet.transformers.
reload (bool) – if True, the first call for a particular featurizer and splitter will cache the datasets to disk, and subsequent calls will reload the cached datasets.
data_dir (str) – a directory to save the raw data in
save_dir (str) – a directory to save the dataset in
Note
DeepChem 2.4.0 has turned on sanitization for this dataset by default. For the QM7 dataset, this means that calling this function will return 6838 compounds instead of 7160 in the source dataset file. This appears to be due to valence specification mismatches in the dataset that weren’t caught in earlier more lax versions of RDKit. Note that this may subtly affect benchmarking results on this dataset.
References
- 1
Rupp, Matthias, et al. “Fast and accurate modeling of molecular atomization energies with machine learning.” Physical review letters 108.5 (2012): 058301.
- 2
Montavon, Grégoire, et al. “Learning invariant representations of molecules for atomization energy prediction.” Advances in Neural Information Proccessing Systems. 2012.
QM8 Datasets¶
- load_qm8(featurizer: typing.Union[deepchem.feat.base_classes.Featurizer, str] = CoulombMatrix[max_atoms=26, remove_hydrogens=False, randomize=False, upper_tri=False, n_samples=1, seed=None], splitter: typing.Optional[typing.Union[deepchem.splits.splitters.Splitter, str]] = 'random', transformers: typing.List[typing.Union[deepchem.molnet.load_function.molnet_loader.TransformerGenerator, str]] = ['normalization'], reload: bool = True, data_dir: typing.Optional[str] = None, save_dir: typing.Optional[str] = None, **kwargs) Tuple[List[str], Tuple[deepchem.data.datasets.Dataset, ...], List[transformers.Transformer]] [source]¶
Load QM8 dataset
QM8 is the dataset used in a study on modeling quantum mechanical calculations of electronic spectra and excited state energy of small molecules. Multiple methods, including time-dependent density functional theories (TDDFT) and second-order approximate coupled-cluster (CC2), are applied to a collection of molecules that include up to eight heavy atoms (also a subset of the GDB-17 database). In our collection, there are four excited state properties calculated by four different methods on 22 thousand samples:
S0 -> S1 transition energy E1 and the corresponding oscillator strength f1
S0 -> S2 transition energy E2 and the corresponding oscillator strength f2
E1, E2, f1, f2 are in atomic units. f1, f2 are in length representation
Random splitting is recommended for this dataset.
The source data contain:
qm8.sdf: molecular structures
qm8.sdf.csv: tables for molecular properties
Column 1: Molecule ID (gdb9 index) mapping to the .sdf file
Columns 2-5: RI-CC2/def2TZVP
Columns 6-9: LR-TDPBE0/def2SVP
Columns 10-13: LR-TDPBE0/def2TZVP
Columns 14-17: LR-TDCAM-B3LYP/def2TZVP
- Parameters
featurizer (Featurizer or str) – the featurizer to use for processing the data. Alternatively you can pass one of the names from dc.molnet.featurizers as a shortcut.
splitter (Splitter or str) – the splitter to use for splitting the data into training, validation, and test sets. Alternatively you can pass one of the names from dc.molnet.splitters as a shortcut. If this is None, all the data will be included in a single dataset.
transformers (list of TransformerGenerators or strings) – the Transformers to apply to the data. Each one is specified by a TransformerGenerator or, as a shortcut, one of the names from dc.molnet.transformers.
reload (bool) – if True, the first call for a particular featurizer and splitter will cache the datasets to disk, and subsequent calls will reload the cached datasets.
data_dir (str) – a directory to save the raw data in
save_dir (str) – a directory to save the dataset in
Note
DeepChem 2.4.0 has turned on sanitization for this dataset by default. For the QM8 dataset, this means that calling this function will return 21747 compounds instead of 21786 in the source dataset file. This appears to be due to valence specification mismatches in the dataset that weren’t caught in earlier more lax versions of RDKit. Note that this may subtly affect benchmarking results on this dataset.
References
- 1
Blum, Lorenz C., and Jean-Louis Reymond. “970 million druglike small molecules for virtual screening in the chemical universe database GDB-13.” Journal of the American Chemical Society 131.25 (2009): 8732-8733.
- 2
Ramakrishnan, Raghunathan, et al. “Electronic spectra from TDDFT and machine learning in chemical space.” The Journal of chemical physics 143.8 (2015): 084111.
QM9 Datasets¶
- load_qm9(featurizer: typing.Union[deepchem.feat.base_classes.Featurizer, str] = CoulombMatrix[max_atoms=29, remove_hydrogens=False, randomize=False, upper_tri=False, n_samples=1, seed=None], splitter: typing.Optional[typing.Union[deepchem.splits.splitters.Splitter, str]] = 'random', transformers: typing.List[typing.Union[deepchem.molnet.load_function.molnet_loader.TransformerGenerator, str]] = ['normalization'], reload: bool = True, data_dir: typing.Optional[str] = None, save_dir: typing.Optional[str] = None, **kwargs) Tuple[List[str], Tuple[deepchem.data.datasets.Dataset, ...], List[transformers.Transformer]] [source]¶
Load QM9 dataset
QM9 is a comprehensive dataset that provides geometric, energetic, electronic and thermodynamic properties for a subset of GDB-17 database, comprising 134 thousand stable organic molecules with up to 9 heavy atoms. All molecules are modeled using density functional theory (B3LYP/6-31G(2df,p) based DFT).
Random splitting is recommended for this dataset.
The source data contain:
qm9.sdf: molecular structures
qm9.sdf.csv: tables for molecular properties
“mol_id” - Molecule ID (gdb9 index) mapping to the .sdf file
“A” - Rotational constant (unit: GHz)
“B” - Rotational constant (unit: GHz)
“C” - Rotational constant (unit: GHz)
“mu” - Dipole moment (unit: D)
“alpha” - Isotropic polarizability (unit: Bohr^3)
“homo” - Highest occupied molecular orbital energy (unit: Hartree)
“lumo” - Lowest unoccupied molecular orbital energy (unit: Hartree)
“gap” - Gap between HOMO and LUMO (unit: Hartree)
“r2” - Electronic spatial extent (unit: Bohr^2)
“zpve” - Zero point vibrational energy (unit: Hartree)
“u0” - Internal energy at 0K (unit: Hartree)
“u298” - Internal energy at 298.15K (unit: Hartree)
“h298” - Enthalpy at 298.15K (unit: Hartree)
“g298” - Free energy at 298.15K (unit: Hartree)
“cv” - Heat capavity at 298.15K (unit: cal/(mol*K))
“u0_atom” - Atomization energy at 0K (unit: kcal/mol)
“u298_atom” - Atomization energy at 298.15K (unit: kcal/mol)
“h298_atom” - Atomization enthalpy at 298.15K (unit: kcal/mol)
“g298_atom” - Atomization free energy at 298.15K (unit: kcal/mol)
“u0_atom” ~ “g298_atom” (used in MoleculeNet) are calculated from the differences between “u0” ~ “g298” and sum of reference energies of all atoms in the molecules, as given in https://figshare.com/articles/Atomref%3A_Reference_thermochemical_energies_of_H%2C_C%2C_N%2C_O%2C_F_atoms./1057643
- Parameters
featurizer (Featurizer or str) – the featurizer to use for processing the data. Alternatively you can pass one of the names from dc.molnet.featurizers as a shortcut.
splitter (Splitter or str) – the splitter to use for splitting the data into training, validation, and test sets. Alternatively you can pass one of the names from dc.molnet.splitters as a shortcut. If this is None, all the data will be included in a single dataset.
transformers (list of TransformerGenerators or strings) – the Transformers to apply to the data. Each one is specified by a TransformerGenerator or, as a shortcut, one of the names from dc.molnet.transformers.
reload (bool) – if True, the first call for a particular featurizer and splitter will cache the datasets to disk, and subsequent calls will reload the cached datasets.
data_dir (str) – a directory to save the raw data in
save_dir (str) – a directory to save the dataset in
Note
DeepChem 2.4.0 has turned on sanitization for this dataset by default. For the QM9 dataset, this means that calling this function will return 132480 compounds instead of 133885 in the source dataset file. This appears to be due to valence specification mismatches in the dataset that weren’t caught in earlier more lax versions of RDKit. Note that this may subtly affect benchmarking results on this dataset.
References
- 1
Blum, Lorenz C., and Jean-Louis Reymond. “970 million druglike small molecules for virtual screening in the chemical universe database GDB-13.” Journal of the American Chemical Society 131.25 (2009): 8732-8733.
- 2
Ramakrishnan, Raghunathan, et al. “Quantum chemistry structures and properties of 134 kilo molecules.” Scientific data 1 (2014): 140022.
SAMPL Datasets¶
- load_sampl(featurizer: Union[deepchem.feat.base_classes.Featurizer, str] = 'ECFP', splitter: Optional[Union[deepchem.splits.splitters.Splitter, str]] = 'scaffold', transformers: List[Union[deepchem.molnet.load_function.molnet_loader.TransformerGenerator, str]] = ['normalization'], reload: bool = True, data_dir: Optional[str] = None, save_dir: Optional[str] = None, **kwargs) Tuple[List[str], Tuple[deepchem.data.datasets.Dataset, ...], List[transformers.Transformer]] [source]¶
Load SAMPL(FreeSolv) dataset
The Free Solvation Database, FreeSolv(SAMPL), provides experimental and calculated hydration free energy of small molecules in water. The calculated values are derived from alchemical free energy calculations using molecular dynamics simulations. The experimental values are included in the benchmark collection.
Random splitting is recommended for this dataset.
The raw data csv file contains columns below:
“iupac” - IUPAC name of the compound
“smiles” - SMILES representation of the molecular structure
- “expt” - Measured solvation energy (unit: kcal/mol) of the compound,
used as label
“calc” - Calculated solvation energy (unit: kcal/mol) of the compound
- Parameters
featurizer (Featurizer or str) – the featurizer to use for processing the data. Alternatively you can pass one of the names from dc.molnet.featurizers as a shortcut.
splitter (Splitter or str) – the splitter to use for splitting the data into training, validation, and test sets. Alternatively you can pass one of the names from dc.molnet.splitters as a shortcut. If this is None, all the data will be included in a single dataset.
transformers (list of TransformerGenerators or strings) – the Transformers to apply to the data. Each one is specified by a TransformerGenerator or, as a shortcut, one of the names from dc.molnet.transformers.
reload (bool) – if True, the first call for a particular featurizer and splitter will cache the datasets to disk, and subsequent calls will reload the cached datasets.
data_dir (str) – a directory to save the raw data in
save_dir (str) – a directory to save the dataset in
References
- 1
Mobley, David L., and J. Peter Guthrie. “FreeSolv: a database of experimental and calculated hydration free energies, with input files.” Journal of computer-aided molecular design 28.7 (2014): 711-720.
SIDER Datasets¶
- load_sider(featurizer: Union[deepchem.feat.base_classes.Featurizer, str] = 'ECFP', splitter: Optional[Union[deepchem.splits.splitters.Splitter, str]] = 'scaffold', transformers: List[Union[deepchem.molnet.load_function.molnet_loader.TransformerGenerator, str]] = ['balancing'], reload: bool = True, data_dir: Optional[str] = None, save_dir: Optional[str] = None, **kwargs) Tuple[List[str], Tuple[deepchem.data.datasets.Dataset, ...], List[transformers.Transformer]] [source]¶
Load SIDER dataset
The Side Effect Resource (SIDER) is a database of marketed drugs and adverse drug reactions (ADR). The version of the SIDER dataset in DeepChem has grouped drug side effects into 27 system organ classes following MedDRA classifications measured for 1427 approved drugs.
Random splitting is recommended for this dataset.
The raw data csv file contains columns below:
“smiles”: SMILES representation of the molecular structure
- “Hepatobiliary disorders” ~ “Injury, poisoning and procedural
complications”: Recorded side effects for the drug. Please refer to http://sideeffects.embl.de/se/?page=98 for details on ADRs.
- Parameters
featurizer (Featurizer or str) – the featurizer to use for processing the data. Alternatively you can pass one of the names from dc.molnet.featurizers as a shortcut.
splitter (Splitter or str) – the splitter to use for splitting the data into training, validation, and test sets. Alternatively you can pass one of the names from dc.molnet.splitters as a shortcut. If this is None, all the data will be included in a single dataset.
transformers (list of TransformerGenerators or strings) – the Transformers to apply to the data. Each one is specified by a TransformerGenerator or, as a shortcut, one of the names from dc.molnet.transformers.
reload (bool) – if True, the first call for a particular featurizer and splitter will cache the datasets to disk, and subsequent calls will reload the cached datasets.
data_dir (str) – a directory to save the raw data in
save_dir (str) – a directory to save the dataset in
References
- 1
Kuhn, Michael, et al. “The SIDER database of drugs and side effects.” Nucleic acids research 44.D1 (2015): D1075-D1079.
- 2
Altae-Tran, Han, et al. “Low data drug discovery with one-shot learning.” ACS central science 3.4 (2017): 283-293.
- 3
Medical Dictionary for Regulatory Activities. http://www.meddra.org/
Thermosol Datasets¶
- load_thermosol(featurizer: Union[deepchem.feat.base_classes.Featurizer, str] = 'ECFP', splitter: Optional[Union[deepchem.splits.splitters.Splitter, str]] = 'scaffold', transformers: List[Union[deepchem.molnet.load_function.molnet_loader.TransformerGenerator, str]] = [], reload: bool = True, data_dir: Optional[str] = None, save_dir: Optional[str] = None, **kwargs) Tuple[List[str], Tuple[deepchem.data.datasets.Dataset, ...], List[transformers.Transformer]] [source]¶
Loads the thermodynamic solubility datasets.
- Parameters
featurizer (Featurizer or str) – the featurizer to use for processing the data. Alternatively you can pass one of the names from dc.molnet.featurizers as a shortcut.
splitter (Splitter or str) – the splitter to use for splitting the data into training, validation, and test sets. Alternatively you can pass one of the names from dc.molnet.splitters as a shortcut. If this is None, all the data will be included in a single dataset.
transformers (list of TransformerGenerators or strings) – the Transformers to apply to the data. Each one is specified by a TransformerGenerator or, as a shortcut, one of the names from dc.molnet.transformers.
reload (bool) – if True, the first call for a particular featurizer and splitter will cache the datasets to disk, and subsequent calls will reload the cached datasets.
data_dir (str) – a directory to save the raw data in
save_dir (str) – a directory to save the dataset in
Tox21 Datasets¶
- load_tox21(featurizer: Union[deepchem.feat.base_classes.Featurizer, str] = 'ECFP', splitter: Optional[Union[deepchem.splits.splitters.Splitter, str]] = 'scaffold', transformers: List[Union[deepchem.molnet.load_function.molnet_loader.TransformerGenerator, str]] = ['balancing'], reload: bool = True, data_dir: Optional[str] = None, save_dir: Optional[str] = None, tasks: List[str] = ['NR-AR', 'NR-AR-LBD', 'NR-AhR', 'NR-Aromatase', 'NR-ER', 'NR-ER-LBD', 'NR-PPAR-gamma', 'SR-ARE', 'SR-ATAD5', 'SR-HSE', 'SR-MMP', 'SR-p53'], **kwargs) Tuple[List[str], Tuple[deepchem.data.datasets.Dataset, ...], List[transformers.Transformer]] [source]¶
Load Tox21 dataset
The “Toxicology in the 21st Century” (Tox21) initiative created a public database measuring toxicity of compounds, which has been used in the 2014 Tox21 Data Challenge. This dataset contains qualitative toxicity measurements for 8k compounds on 12 different targets, including nuclear receptors and stress response pathways.
Random splitting is recommended for this dataset.
The raw data csv file contains columns below:
“smiles” - SMILES representation of the molecular structure
“NR-XXX” - Nuclear receptor signaling bioassays results
“SR-XXX” - Stress response bioassays results
please refer to https://tripod.nih.gov/tox21/challenge/data.jsp for details.
- Parameters
featurizer (Featurizer or str) – the featurizer to use for processing the data. Alternatively you can pass one of the names from dc.molnet.featurizers as a shortcut.
splitter (Splitter or str) – the splitter to use for splitting the data into training, validation, and test sets. Alternatively you can pass one of the names from dc.molnet.splitters as a shortcut. If this is None, all the data will be included in a single dataset.
transformers (list of TransformerGenerators or strings) – the Transformers to apply to the data. Each one is specified by a TransformerGenerator or, as a shortcut, one of the names from dc.molnet.transformers.
reload (bool) – if True, the first call for a particular featurizer and splitter will cache the datasets to disk, and subsequent calls will reload the cached datasets.
data_dir (str) – a directory to save the raw data in
save_dir (str) – a directory to save the dataset in
tasks (List[str], (optional)) – Specify the set of tasks to load. If no task is specified, then it loads
NR-AR (the default set of tasks which are) –
NR-AR-LBD –
NR-AhR –
NR-Aromatase –
NR-ER –
:param : :param NR-ER-LBD: :param NR-PPAR-gamma: :param SR-ARE: :param SR-ATAD5: :param SR-HSE: :param SR-MMP: :param SR-p53.:
References
- 1
Tox21 Challenge. https://tripod.nih.gov/tox21/challenge/
Toxcast Datasets¶
- load_toxcast(featurizer: Union[deepchem.feat.base_classes.Featurizer, str] = 'ECFP', splitter: Optional[Union[deepchem.splits.splitters.Splitter, str]] = 'scaffold', transformers: List[Union[deepchem.molnet.load_function.molnet_loader.TransformerGenerator, str]] = ['balancing'], reload: bool = True, data_dir: Optional[str] = None, save_dir: Optional[str] = None, **kwargs) Tuple[List[str], Tuple[deepchem.data.datasets.Dataset, ...], List[transformers.Transformer]] [source]¶
Load Toxcast dataset
ToxCast is an extended data collection from the same initiative as Tox21, providing toxicology data for a large library of compounds based on in vitro high-throughput screening. The processed collection includes qualitative results of over 600 experiments on 8k compounds.
Random splitting is recommended for this dataset.
The raw data csv file contains columns below:
“smiles”: SMILES representation of the molecular structure
- “ACEA_T47D_80hr_Negative” ~ “Tanguay_ZF_120hpf_YSE_up”: Bioassays results.
Please refer to the section “high-throughput assay information” at https://www.epa.gov/chemical-research/toxicity-forecaster-toxcasttm-data for details.
- Parameters
featurizer (Featurizer or str) – the featurizer to use for processing the data. Alternatively you can pass one of the names from dc.molnet.featurizers as a shortcut.
splitter (Splitter or str) – the splitter to use for splitting the data into training, validation, and test sets. Alternatively you can pass one of the names from dc.molnet.splitters as a shortcut. If this is None, all the data will be included in a single dataset.
transformers (list of TransformerGenerators or strings) – the Transformers to apply to the data. Each one is specified by a TransformerGenerator or, as a shortcut, one of the names from dc.molnet.transformers.
reload (bool) – if True, the first call for a particular featurizer and splitter will cache the datasets to disk, and subsequent calls will reload the cached datasets.
data_dir (str) – a directory to save the raw data in
save_dir (str) – a directory to save the dataset in
References
- 1
Richard, Ann M., et al. “ToxCast chemical landscape: paving the road to 21st century toxicology.” Chemical research in toxicology 29.8 (2016): 1225-1251.
USPTO Datasets¶
- load_uspto(featurizer: Union[deepchem.feat.base_classes.Featurizer, str] = 'RxnFeaturizer', splitter: Optional[Union[deepchem.splits.splitters.Splitter, str]] = None, transformers: List[Union[deepchem.molnet.load_function.molnet_loader.TransformerGenerator, str]] = [], reload: bool = True, data_dir: Optional[str] = None, save_dir: Optional[str] = None, subset: str = 'MIT', sep_reagent: bool = True, **kwargs) Tuple[List[str], Tuple[deepchem.data.datasets.Dataset, ...], List[transformers.Transformer]] [source]¶
Load USPTO Datasets.
The USPTO dataset consists of over 1.8 Million organic chemical reactions extracted from US patents and patent applications. The dataset contains the reactions in the form of reaction SMILES, which have the general format: reactant>reagent>product.
Molnet provides ability to load subsets of the USPTO dataset namely MIT, STEREO and 50K. The MIT dataset contains around 479K reactions, curated by jin et al. The STEREO dataset contains around 1 Million Reactions, it does not have duplicates and the reactions include stereochemical information. The 50K dataset contatins 50,000 reactions and is the benchmark for retrosynthesis predictions. The reactions are additionally classified into 10 reaction classes. The canonicalized version of the dataset used by the loader is the same as that used by Somnath et. al.
The loader uses the SpecifiedSplitter to use the same splits as specified by Schwaller et. al and Dai et. al. Custom splitters could also be used. There is a toggle in the loader to skip the source/target transformation needed for seq2seq tasks. There is an additional toggle to load the dataset with the reagents and reactants separated or mixed. This alters the entries in source by replacing the ‘>’ with ‘.’ , effectively loading them as an unified SMILES string.
- Parameters
featurizer (Featurizer or str) – the featurizer to use for processing the data. Alternatively you can pass one of the names from dc.molnet.featurizers as a shortcut.
splitter (Splitter or str) – the splitter to use for splitting the data into training, validation, and test sets. Alternatively you can pass one of the names from dc.molnet.splitters as a shortcut. If this is None, all the data will be included in a single dataset.
transformers (list of TransformerGenerators or strings) – the Transformers to apply to the data. Each one is specified by a TransformerGenerator or, as a shortcut, one of the names from dc.molnet.transformers.
reload (bool) – if True, the first call for a particular featurizer and splitter will cache the datasets to disk, and subsequent calls will reload the cached datasets.
data_dir (str) – a directory to save the raw data in
save_dir (str) – a directory to save the dataset in
subset (str (default 'MIT')) – Subset of dataset to download. ‘FULL’, ‘MIT’, ‘STEREO’, and ‘50K’ are supported.
sep_reagent (bool (default True)) – Toggle to load dataset with reactants and reagents either separated or mixed.
skip_transform (bool (default True)) – Toggle to skip the source/target transformation.
- Returns
tasks, datasets, transformers –
- taskslist
Column names corresponding to machine learning target variables.
- datasetstuple
train, validation, test splits of data as
deepchem.data.datasets.Dataset
instances.- transformerslist
deepchem.trans.transformers.Transformer
instances applied to dataset.
- Return type
tuple
References
- 1
Lowe, D. Chemical reactions from US patents (1976-Sep2016) (Version 1). figshare (2017). https://doi.org/10.6084/m9.figshare.5104873.v1
- 2
Somnath, Vignesh Ram, et al. “Learning graph models for retrosynthesis prediction.” arXiv preprint arXiv:2006.07038 (2020).
- 3
Schwaller, Philippe, et al. “Molecular transformer: a model for uncertainty-calibrated chemical reaction prediction.” ACS central science 5.9 (2019): 1572-1583.
- 4
Dai, Hanjun, et al. “Retrosynthesis prediction with conditional graph logic network.” arXiv preprint arXiv:2001.01408 (2020).
UV Datasets¶
- load_uv(shard_size=2000, featurizer=None, split=None, reload=True)[source]¶
Load UV dataset; does not do train/test split
The UV dataset is an in-house dataset from Merck that was first introduced in the following paper: Ramsundar, Bharath, et al. “Is multitask deep learning practical for pharma?.” Journal of chemical information and modeling 57.8 (2017): 2068-2076.
The UV dataset tests 10,000 of Merck’s internal compounds on 190 absorption wavelengths between 210 and 400 nm. Unlike most of the other datasets featured in MoleculeNet, the UV collection does not have structures for the compounds tested since they were proprietary Merck compounds. However, the collection does feature pre-computed descriptors for these compounds.
Note that the original train/valid/test split from the source data was preserved here, so this function doesn’t allow for alternate modes of splitting. Similarly, since the source data came pre-featurized, it is not possible to apply alternative featurizations.
- Parameters
shard_size (int, optional) – Size of the DiskDataset shards to write on disk
featurizer (optional) – Ignored since featurization pre-computed
split (optional) – Ignored since split pre-computed
reload (bool, optional) – Whether to automatically re-load from disk
ZINC15 Datasets¶
- load_zinc15(featurizer: Union[deepchem.feat.base_classes.Featurizer, str] = 'OneHot', splitter: Optional[Union[deepchem.splits.splitters.Splitter, str]] = 'random', transformers: List[Union[deepchem.molnet.load_function.molnet_loader.TransformerGenerator, str]] = ['normalization'], reload: bool = True, data_dir: Optional[str] = None, save_dir: Optional[str] = None, dataset_size: str = '250K', dataset_dimension: str = '2D', tasks: List[str] = ['mwt', 'logp', 'reactive'], **kwargs) Tuple[List[str], Tuple[deepchem.data.datasets.Dataset, ...], List[transformers.Transformer]] [source]¶
Load zinc15.
ZINC15 is a dataset of over 230 million purchasable compounds for virtual screening of small molecules to identify structures that are likely to bind to drug targets. ZINC15 data is currently available in 2D (SMILES string) format.
MolNet provides subsets of 250K, 1M, and 10M “lead-like” compounds from ZINC15. The full dataset of 270M “goldilocks” compounds is also available. Compounds in ZINC15 are labeled by their molecular weight and LogP (solubility) values. Each compound also has information about how readily available (purchasable) it is and its reactivity. Lead-like compounds have molecular weight between 300 and 350 Daltons and LogP between -1 and 3.5. Goldilocks compounds are lead-like compounds with LogP values further restricted to between 2 and 3.
If reload = True and data_dir (save_dir) is specified, the loader will attempt to load the raw dataset (featurized dataset) from disk. Otherwise, the dataset will be downloaded from the DeepChem AWS bucket.
For more information on ZINC15, please see [1]_ and https://zinc15.docking.org/.
- Parameters
featurizer (Featurizer or str) – the featurizer to use for processing the data. Alternatively you can pass one of the names from dc.molnet.featurizers as a shortcut.
splitter (Splitter or str) – the splitter to use for splitting the data into training, validation, and test sets. Alternatively you can pass one of the names from dc.molnet.splitters as a shortcut. If this is None, all the data will be included in a single dataset.
transformers (list of TransformerGenerators or strings) – the Transformers to apply to the data. Each one is specified by a TransformerGenerator or, as a shortcut, one of the names from dc.molnet.transformers.
reload (bool) – if True, the first call for a particular featurizer and splitter will cache the datasets to disk, and subsequent calls will reload the cached datasets.
data_dir (str) – a directory to save the raw data in
save_dir (str) – a directory to save the dataset in
size (str (default '250K')) – Size of dataset to download. ‘250K’, ‘1M’, ‘10M’, and ‘270M’ are supported.
format (str (default '2D')) – Format of data to download. 2D SMILES strings or 3D SDF files.
tasks (List[str], (optional) default: [‘molwt’, ‘logp’, ‘reactive’]) – Specify the set of tasks to load. If no task is specified, then it loads
molwt (the default set of tasks which are) –
logp –
reactive. –
- Returns
tasks, datasets, transformers –
- taskslist
Column names corresponding to machine learning target variables.
- datasetstuple
train, validation, test splits of data as
deepchem.data.datasets.Dataset
instances.- transformerslist
deepchem.trans.transformers.Transformer
instances applied to dataset.
- Return type
tuple
Notes
The total ZINC dataset with SMILES strings contains hundreds of millions of compounds and is over 100GB! ZINC250K is recommended for experimentation. The full set of 270M goldilocks compounds is 23GB.
References
- 1
Sterling and Irwin. J. Chem. Inf. Model, 2015 http://pubs.acs.org/doi/abs/10.1021/acs.jcim.5b00559.
Platinum Adsorption Dataset¶
- load_Platinum_Adsorption(featurizer: typing.Union[deepchem.feat.base_classes.Featurizer, str] = SineCoulombMatrix[max_atoms=100, flatten=True], splitter: typing.Optional[typing.Union[deepchem.splits.splitters.Splitter, str]] = 'random', transformers: typing.List[typing.Union[deepchem.molnet.load_function.molnet_loader.TransformerGenerator, str]] = [], reload: bool = True, data_dir: typing.Optional[str] = None, save_dir: typing.Optional[str] = None, **kwargs) Tuple[List[str], Tuple[deepchem.data.datasets.Dataset, ...], List[transformers.Transformer]] [source]¶
Load Platinum Adsorption Dataset
The dataset consist of diffrent configurations of Adsorbates (i.e N and NO) on Platinum surface represented as Lattice and their formation energy. There are 648 diffrent adsorbate configuration in this datasets represented as Pymatgen Structure objects.
- Pymatgen structure object with site_properties with following key value.
- “SiteTypes”, mentioning if it is a active site “A1” or spectator
site “S1”.
“oss”, diffrent occupational sites. For spectator sites make it -1.
- Parameters
featurizer (Featurizer (default LCNNFeaturizer)) – the featurizer to use for processing the data. Reccomended to use the LCNNFeaturiser.
splitter (Splitter (default RandomSplitter)) – the splitter to use for splitting the data into training, validation, and test sets. Alternatively you can pass one of the names from dc.molnet.splitters as a shortcut. If this is None, all the data will be included in a single dataset.
transformers (list of TransformerGenerators or strings. the Transformers to) – apply to the data and appropritate featuriser. Does’nt require any transformation for LCNN_featuriser
reload (bool) – if True, the first call for a particular featurizer and splitter will cache the datasets to disk, and subsequent calls will reload the cached datasets.
data_dir (str) – a directory to save the raw data in
save_dir (str, optional (default None)) – a directory to save the dataset in
References
- 1
Jonathan Lym, Geun Ho G. “Lattice Convolutional Neural Network Modeling of Adsorbate Coverage Effects”J. Phys. Chem. C 2019, 123, 18951−18959
Examples
>>> >> import deepchem as dc >> tasks, datasets, transformers = load_Platinum_Adsorption( >> reload=True, >> data_dir=data_path, >> save_dir=data_path, >> featurizer_kwargs=feat_args) >> train_dataset, val_dataset, test_dataset = datasets