# Examples¶

Before jumping in to examples, we’ll import our libraries and ensure our doctests are reproducible:

```
>>> import numpy as np
>>> import tensorflow as tf
>>> import deepchem as dc
>>>
>>> # Run before every test for reproducibility
>>> def seed_all():
... np.random.seed(123)
... tf.random.set_seed(123)
>>>
```

Other notes:

- We match against doctest’s
`...`

wildcard on code where output is usually ignored - We often use threshold assertions (e.g:
`score['mean-pearson_r2_score'] > 0.92`

), as this is what matters for model training code.

## SAMPL (FreeSolv)¶

Examples of training models on the SAMPL(FreeSolv) dataset included in MoleculeNet.

We’ll be using its `smiles`

field to train models to predict its experimentally measured solvation energy (`expt`

).

### MultitaskRegressor¶

First, we’ll load the dataset with `load_sampl()`

and fit a `MultitaskRegressor`

:

```
>>> seed_all()
>>> # Load SAMPL dataset with default 'index' splitting
>>> SAMPL_tasks, SAMPL_datasets, transformers = dc.molnet.load_sampl()
>>> SAMPL_tasks
['expt']
>>> train_dataset, valid_dataset, test_dataset = SAMPL_datasets
>>>
>>> # We want to know the pearson R squared score, averaged across tasks
>>> avg_pearson_r2 = dc.metrics.Metric(dc.metrics.pearson_r2_score, np.mean)
>>>
>>> # We'll train a multitask regressor (fully connected network)
>>> model = dc.models.MultitaskRegressor(
... len(SAMPL_tasks),
... n_features = 1024,
... layer_sizes=[1000],
... dropouts=[.25],
... learning_rate=0.001,
... batch_size=50)
>>>
>>> model.fit(train_dataset)
0...
>>>
>>> # We now evaluate our fitted model on our training and validation sets
>>> train_scores = model.evaluate(train_dataset, [avg_pearson_r2], transformers)
>>> assert train_scores['mean-pearson_r2_score'] > 0.9, train_scores
>>>
>>> valid_scores = model.evaluate(valid_dataset, [avg_pearson_r2], transformers)
>>> assert valid_scores['mean-pearson_r2_score'] > 0.7, valid_scores
```

### GraphConvModel¶

The default featurizer for SAMPL is `ECFP`

, short for
“Extended-connectivity fingerprints.”
For a `GraphConvModel`

, we’ll reload our datasets with `featurizer='GraphConv'`

:

```
>>> seed_all()
>>> # Load SAMPL dataset
>>> SAMPL_tasks, SAMPL_datasets, transformers = dc.molnet.load_sampl(
... featurizer='GraphConv')
>>> train_dataset, valid_dataset, test_dataset = SAMPL_datasets
>>>
>>> model = dc.models.GraphConvModel(len(SAMPL_tasks), mode='regression')
>>>
>>> model.fit(train_dataset, nb_epoch=20)
0...
>>>
>>> # We now evaluate our fitted model on our training and validation sets
>>> train_scores = model.evaluate(train_dataset, [avg_pearson_r2], transformers)
>>> assert train_scores['mean-pearson_r2_score'] > 0.5, train_scores
>>>
>>> valid_scores = model.evaluate(valid_dataset, [avg_pearson_r2], transformers)
>>> assert valid_scores['mean-pearson_r2_score'] > 0.3, valid_scores
```

## ChEMBL¶

Examples of training models on ChEMBL <https://www.ebi.ac.uk/chembl/> dataset included in MoleculeNet.

ChEMBL is a manually curated database of bioactive molecules with drug-like properties. It brings together chemical, bioactivity and genomic data to aid the translation of genomic information into effective new drugs.

### MultitaskRegressor¶

```
>>> seed_all()
>>> # Load ChEMBL 5thresh dataset with random splitting
>>> chembl_tasks, datasets, transformers = dc.molnet.load_chembl(
... shard_size=2000, featurizer="ECFP", set="5thresh", split="random")
>>> train_dataset, valid_dataset, test_dataset = datasets
>>> len(chembl_tasks)
691
>>> f'Compound train/valid/test split: {len(train_dataset)}/{len(valid_dataset)}/{len(test_dataset)}'
'Compound train/valid/test split: 19096/2387/2388'
>>>
>>> # We want to know the RMS, averaged across tasks
>>> avg_rms = dc.metrics.Metric(dc.metrics.rms_score, np.mean)
>>>
>>> # Create our model
>>> n_layers = 3
>>> model = dc.models.MultitaskRegressor(
... len(chembl_tasks),
... n_features=1024,
... layer_sizes=[1000] * n_layers,
... dropouts=[.25] * n_layers,
... weight_init_stddevs=[.02] * n_layers,
... bias_init_consts=[1.] * n_layers,
... learning_rate=.0003,
... weight_decay_penalty=.0001,
... batch_size=100)
>>>
>>> model.fit(train_dataset, nb_epoch=5)
0...
>>>
>>> # We now evaluate our fitted model on our training and validation sets
>>> train_scores = model.evaluate(train_dataset, [avg_rms], transformers)
>>> assert train_scores['mean-rms_score'] < 10.00
>>>
>>> valid_scores = model.evaluate(valid_dataset, [avg_rms], transformers)
>>> assert valid_scores['mean-rms_score'] < 10.00
```

### GraphConvModel¶

```
>>> # Load ChEMBL dataset
>>> chembl_tasks, datasets, transformers = dc.molnet.load_chembl(
... shard_size=2000, featurizer="GraphConv", set="5thresh", split="random")
>>> train_dataset, valid_dataset, test_dataset = datasets
>>>
>>> # RMS, averaged across tasks
>>> avg_rms = dc.metrics.Metric(dc.metrics.rms_score, np.mean)
>>>
>>> model = dc.models.GraphConvModel(
... len(chembl_tasks), batch_size=128, mode='regression')
>>>
>>> # Fit trained model
>>> model.fit(train_dataset, nb_epoch=5)
0...
>>>
>>> # We now evaluate our fitted model on our training and validation sets
>>> train_scores = model.evaluate(train_dataset, [avg_rms], transformers)
>>> assert train_scores['mean-rms_score'] < 10.00
>>>
>>> valid_scores = model.evaluate(valid_dataset, [avg_rms], transformers)
>>> assert valid_scores['mean-rms_score'] < 10.00
```