Utilities

DeepChem has a broad collection of utility functions. Many of these maybe be of independent interest to users since they deal with some tricky aspects of processing scientific datatypes.

Data Utilities

Array Utilities

deepchem.utils.data_utils.pad_array(x: numpy.ndarray, shape: Union[Tuple, int], fill: float = 0.0, both: bool = False) → numpy.ndarray[source]

Pad an array with a fill value.

Parameters:
  • x (np.ndarray) – A numpy array.
  • shape (Tuple or int) – Desired shape. If int, all dimensions are padded to that size.
  • fill (float, optional (default 0.0)) – The padded value.
  • both (bool, optional (default False)) – If True, split the padding on both sides of each axis. If False, padding is applied to the end of each axis.
Returns:

A padded numpy array

Return type:

np.ndarray

Data Directory

The DeepChem data directory is where downloaded MoleculeNet datasets are stored.

deepchem.utils.data_utils.get_data_dir() → str[source]

Get the DeepChem data directory.

Returns:The default path to store DeepChem data. If you want to change this path, please set your own path to DEEPCHEM_DATA_DIR as an environment variable.
Return type:str

URL Handling

deepchem.utils.data_utils.download_url(url: str, dest_dir: str = '/tmp', name: Optional[str] = None)[source]

Download a file to disk.

Parameters:
  • url (str) – The URL to download from
  • dest_dir (str) – The directory to save the file in
  • name (str) – The file name to save it as. If omitted, it will try to extract a file name from the URL

File Handling

deepchem.utils.data_utils.untargz_file(file: str, dest_dir: str = '/tmp', name: Optional[str] = None)[source]

Untar and unzip a .tar.gz file to disk.

Parameters:
  • file (str) – The filepath to decompress
  • dest_dir (str) – The directory to save the file in
  • name (str) – The file name to save it as. If omitted, it will use the file name
deepchem.utils.data_utils.unzip_file(file: str, dest_dir: str = '/tmp', name: Optional[str] = None)[source]

Unzip a .zip file to disk.

Parameters:
  • file (str) – The filepath to decompress
  • dest_dir (str) – The directory to save the file in
  • name (str) – The directory name to unzip it to. If omitted, it will use the file name
deepchem.utils.data_utils.load_data(input_files: List[str], shard_size: Optional[int] = None) → Iterator[Any][source]

Loads data from files.

Parameters:
  • input_files (List[str]) – List of filenames.
  • shard_size (int, default None) – Size of shard to yield
Returns:

Iterator which iterates over provided files.

Return type:

Iterator[Any]

Notes

The supported file types are SDF, CSV and Pickle.

deepchem.utils.data_utils.load_sdf_files(input_files: List[str], clean_mols: bool = True, tasks: List[str] = [], shard_size: Optional[int] = None) → Iterator[pandas.core.frame.DataFrame][source]

Load SDF file into dataframe.

Parameters:
  • input_files (List[str]) – List of filenames
  • clean_mols (bool, default True) – Whether to sanitize molecules.
  • tasks (List[str], default []) – Each entry in tasks is treated as a property in the SDF file and is retrieved with mol.GetProp(str(task)) where mol is the RDKit mol loaded from a given SDF entry.
  • shard_size (int, default None) – The shard size to yield at one time.
Returns:

Generator which yields the dataframe which is the same shard size.

Return type:

Iterator[pd.DataFrame]

Notes

This function requires RDKit to be installed.

deepchem.utils.data_utils.load_csv_files(input_files: List[str], shard_size: Optional[int] = None) → Iterator[pandas.core.frame.DataFrame][source]

Load data as pandas dataframe from CSV files.

Parameters:
  • input_files (List[str]) – List of filenames
  • shard_size (int, default None) – The shard size to yield at one time.
Returns:

Generator which yields the dataframe which is the same shard size.

Return type:

Iterator[pd.DataFrame]

deepchem.utils.data_utils.load_json_files(input_files: List[str], shard_size: Optional[int] = None) → Iterator[pandas.core.frame.DataFrame][source]

Load data as pandas dataframe.

Parameters:
  • input_files (List[str]) – List of json filenames.
  • shard_size (int, default None) – Chunksize for reading json files.
Returns:

Generator which yields the dataframe which is the same shard size.

Return type:

Iterator[pd.DataFrame]

Notes

To load shards from a json file into a Pandas dataframe, the file must be originally saved with df.to_json('filename.json', orient='records', lines=True)

deepchem.utils.data_utils.load_pickle_files(input_files: List[str]) → Iterator[Any][source]

Load dataset from pickle file.

Parameters:input_files (List[str]) – The list of filenames of pickle file. This function can load from gzipped pickle file like XXXX.pkl.gz.
Returns:Generator which yields the objects which is loaded from each pickle file.
Return type:Iterator[Any]
deepchem.utils.data_utils.load_from_disk(filename: str) → Any[source]

Load a dataset from file.

Parameters:filename (str) – A filename you want to load data.
Returns:A loaded object from file.
Return type:Any
deepchem.utils.data_utils.save_to_disk(dataset: Any, filename: str, compress: int = 3)[source]

Save a dataset to file.

Parameters:
  • dataset (str) – A data saved
  • filename (str) – Path to save data.
  • compress (int, default 3) – The compress option when dumping joblib file.
deepchem.utils.data_utils.load_dataset_from_disk(save_dir: str) → Tuple[bool, Optional[Tuple[deepchem.data.datasets.DiskDataset, deepchem.data.datasets.DiskDataset, deepchem.data.datasets.DiskDataset]], List[transformers.Transformer]][source]

Loads MoleculeNet train/valid/test/transformers from disk.

Expects that data was saved using save_dataset_to_disk below. Expects the following directory structure for save_dir: save_dir/


—> train_dir/ | —> valid_dir/ | —> test_dir/ | —> transformers.pkl

Parameters:save_dir (str) – Directory name to load datasets.
Returns:
  • loaded (bool) – Whether the load succeeded
  • all_dataset (Tuple[DiskDataset, DiskDataset, DiskDataset]) – The train, valid, test datasets
  • transformers (Transformer) – The transformers used for this dataset
deepchem.utils.data_utils.save_dataset_to_disk(save_dir: str, train: deepchem.data.datasets.DiskDataset, valid: deepchem.data.datasets.DiskDataset, test: deepchem.data.datasets.DiskDataset, transformers: List[dc.trans.Transformer])[source]

Utility used by MoleculeNet to save train/valid/test datasets.

This utility function saves a train/valid/test split of a dataset along with transformers in the same directory. The saved datasets will take the following structure: save_dir/


—> train_dir/ | —> valid_dir/ | —> test_dir/ | —> transformers.pkl

Parameters:
  • save_dir (str) – Directory name to save datasets to.
  • train (DiskDataset) – Training dataset to save.
  • valid (DiskDataset) – Validation dataset to save.
  • test (DiskDataset) – Test dataset to save.
  • transformers (List[Transformer]) – List of transformers to save to disk.

Molecular Utilities

class deepchem.utils.conformers.ConformerGenerator(max_conformers: int = 1, rmsd_threshold: float = 0.5, force_field: str = 'uff', pool_multiplier: int = 10)[source]

Generate molecule conformers.

Notes

Procedure 1. Generate a pool of conformers. 2. Minimize conformers. 3. Prune conformers using an RMSD threshold.

Note that pruning is done _after_ minimization, which differs from the protocol described in the references [1] [2].

References

[1]http://rdkit.org/docs/GettingStartedInPython.html#working-with-3d-molecules
[2]http://pubs.acs.org/doi/full/10.1021/ci2004658

Notes

This class requires RDKit to be installed.

__init__(max_conformers: int = 1, rmsd_threshold: float = 0.5, force_field: str = 'uff', pool_multiplier: int = 10)[source]
Parameters:
  • max_conformers (int, optional (default 1)) – Maximum number of conformers to generate (after pruning).
  • rmsd_threshold (float, optional (default 0.5)) – RMSD threshold for pruning conformers. If None or negative, no pruning is performed.
  • force_field (str, optional (default 'uff')) – Force field to use for conformer energy calculation and minimization. Options are ‘uff’, ‘mmff94’, and ‘mmff94s’.
  • pool_multiplier (int, optional (default 10)) – Factor to multiply by max_conformers to generate the initial conformer pool. Since conformers are pruned after energy minimization, increasing the size of the pool increases the chance of identifying max_conformers unique conformers.
embed_molecule(mol: Any) → Any[source]

Generate conformers, possibly with pruning.

Parameters:mol (rdkit.Chem.rdchem.Mol) – RDKit Mol object
Returns:mol – RDKit Mol object with embedded multiple conformers.
Return type:rdkit.Chem.rdchem.Mol
generate_conformers(mol: Any) → Any[source]

Generate conformers for a molecule.

This function returns a copy of the original molecule with embedded conformers.

Parameters:mol (rdkit.Chem.rdchem.Mol) – RDKit Mol object
Returns:mol – A new RDKit Mol object containing the chosen conformers, sorted by increasing energy.
Return type:rdkit.Chem.rdchem.Mol
get_conformer_energies(mol: Any) → numpy.ndarray[source]

Calculate conformer energies.

Parameters:mol (rdkit.Chem.rdchem.Mol) – RDKit Mol object with embedded conformers.
Returns:energies – Minimized conformer energies.
Return type:np.ndarray
static get_conformer_rmsd(mol: Any) → numpy.ndarray[source]

Calculate conformer-conformer RMSD.

Parameters:mol (rdkit.Chem.rdchem.Mol) – RDKit Mol object
Returns:rmsd – A conformer-conformer RMSD value. The shape is (NumConformers, NumConformers)
Return type:np.ndarray
get_molecule_force_field(mol: Any, conf_id: Optional[int] = None, **kwargs) → Any[source]

Get a force field for a molecule.

Parameters:
  • mol (rdkit.Chem.rdchem.Mol) – RDKit Mol object with embedded conformers.
  • conf_id (int, optional) – ID of the conformer to associate with the force field.
  • kwargs (dict, optional) – Keyword arguments for force field constructor.
Returns:

ff – RDKit force field instance for a molecule.

Return type:

rdkit.ForceField.rdForceField.ForceField

minimize_conformers(mol: Any) → None[source]

Minimize molecule conformers.

Parameters:mol (rdkit.Chem.rdchem.Mol) – RDKit Mol object with embedded conformers.
prune_conformers(mol: Any) → Any[source]

Prune conformers from a molecule using an RMSD threshold, starting with the lowest energy conformer.

Parameters:mol (rdkit.Chem.rdchem.Mol) – RDKit Mol object
Returns:new_mol – A new rdkit.Chem.rdchem.Mol containing the chosen conformers, sorted by increasing energy.
Return type:rdkit.Chem.rdchem.Mol
class deepchem.utils.rdkit_utils.MoleculeLoadException(*args, **kwargs)[source]
__init__(*args, **kwargs)[source]

Initialize self. See help(type(self)) for accurate signature.

with_traceback()

Exception.with_traceback(tb) – set self.__traceback__ to tb and return self.

deepchem.utils.rdkit_utils.get_xyz_from_mol(mol)[source]

Extracts a numpy array of coordinates from a molecules.

Returns a (N, 3) numpy array of 3d coords of given rdkit molecule

Parameters:mol (rdkit Molecule) – Molecule to extract coordinates for
Returns:
Return type:Numpy ndarray of shape (N, 3) where N = mol.GetNumAtoms().
deepchem.utils.rdkit_utils.add_hydrogens_to_mol(mol, is_protein=False)[source]

Add hydrogens to a molecule object

Parameters:
  • mol (Rdkit Mol) – Molecule to hydrogenate
  • is_protein (bool, optional (default False)) – Whether this molecule is a protein.
Returns:

Return type:

Rdkit Mol

Note

This function requires RDKit and PDBFixer to be installed.

deepchem.utils.rdkit_utils.compute_charges(mol)[source]

Attempt to compute Gasteiger Charges on Mol

This also has the side effect of calculating charges on mol. The mol passed into this function has to already have been sanitized

Parameters:mol (rdkit molecule) –
Returns:
Return type:No return since updates in place.

Note

This function requires RDKit to be installed.

deepchem.utils.rdkit_utils.load_molecule(molecule_file, add_hydrogens=True, calc_charges=True, sanitize=True, is_protein=False)[source]

Converts molecule file to (xyz-coords, obmol object)

Given molecule_file, returns a tuple of xyz coords of molecule and an rdkit object representing that molecule in that order (xyz, rdkit_mol). This ordering convention is used in the code in a few places.

Parameters:
  • molecule_file (str) – filename for molecule
  • add_hydrogens (bool, optional (default True)) – If True, add hydrogens via pdbfixer
  • calc_charges (bool, optional (default True)) – If True, add charges via rdkit
  • sanitize (bool, optional (default False)) – If True, sanitize molecules via rdkit
  • is_protein (bool, optional (default False)) – If True`, this molecule is loaded as a protein. This flag will affect some of the cleanup procedures applied.
Returns:

  • Tuple (xyz, mol) if file contains single molecule. Else returns a
  • list of the tuples for the separate molecules in this list.

Note

This function requires RDKit to be installed.

deepchem.utils.rdkit_utils.write_molecule(mol, outfile, is_protein=False)[source]

Write molecule to a file

This function writes a representation of the provided molecule to the specified outfile. Doesn’t return anything.

Parameters:
  • mol (rdkit Mol) – Molecule to write
  • outfile (str) – Filename to write mol to
  • is_protein (bool, optional) – Is this molecule a protein?

Note

This function requires RDKit to be installed.

Raises:ValueError: if outfile isn’t of a supported format.

Molecular Fragment Utilities

It’s often convenient to manipulate subsets of a molecule. The MolecularFragment class aids in such manipulations.

class deepchem.utils.fragment_utils.MolecularFragment(atoms: Sequence[Any], coords: numpy.ndarray)[source]

A class that represents a fragment of a molecule.

It’s often convenient to represent a fragment of a molecule. For example, if two molecules form a molecular complex, it may be useful to create two fragments which represent the subsets of each molecule that’s close to the other molecule (in the contact region).

Ideally, we’d be able to do this in RDKit direct, but manipulating molecular fragments doesn’t seem to be supported functionality.

Examples

>>> import numpy as np
>>> from rdkit import Chem
>>> mol = Chem.MolFromSmiles("C")
>>> coords = np.array([[0.0, 0.0, 0.0]])
>>> atom = mol.GetAtoms()[0]
>>> fragment = MolecularFragment([atom], coords)
GetAtoms() → List[deepchem.utils.fragment_utils.AtomShim][source]

Returns the list of atoms

Returns:list of atoms in this fragment.
Return type:List[AtomShim]
GetCoords() → numpy.ndarray[source]

Returns 3D coordinates for this fragment as numpy array.

Returns:A numpy array of shape (N, 3) with coordinates for this fragment. Here, N is the number of atoms.
Return type:np.ndarray
__init__(atoms: Sequence[Any], coords: numpy.ndarray)[source]

Initialize this object.

Parameters:
  • atoms (Iterable[rdkit.Chem.rdchem.Atom]) – Each entry in this list should be a RDKit Atom.
  • coords (np.ndarray) – Array of locations for atoms of shape (N, 3) where N == len(atoms).
class deepchem.utils.fragment_utils.AtomShim(atomic_num: int, partial_charge: float, atom_coords: numpy.ndarray)[source]

This is a shim object wrapping an atom.

We use this class instead of raw RDKit atoms since manipulating a large number of rdkit Atoms seems to result in segfaults. Wrapping the basic information in an AtomShim seems to avoid issues.

GetAtomicNum() → int[source]

Returns atomic number for this atom.

Returns:Atomic number for this atom.
Return type:int
GetCoords() → numpy.ndarray[source]

Returns 3D coordinates for this atom as numpy array.

Returns:Numpy array of shape (3,) with coordinates for this atom.
Return type:np.ndarray
GetPartialCharge() → float[source]

Returns partial charge for this atom.

Returns:A partial Gasteiger charge for this atom.
Return type:float
__init__(atomic_num: int, partial_charge: float, atom_coords: numpy.ndarray)[source]

Initialize this object

Parameters:
  • atomic_num (int) – Atomic number for this atom.
  • partial_charge (float) – The partial Gasteiger charge for this atom
  • atom_coords (np.ndarray) – Of shape (3,) with the coordinates of this atom
deepchem.utils.fragment_utils.strip_hydrogens(coords: numpy.ndarray, mol: Union[Any, deepchem.utils.fragment_utils.MolecularFragment]) → Tuple[numpy.ndarray, deepchem.utils.fragment_utils.MolecularFragment][source]

Strip the hydrogens from input molecule

Parameters:
  • coords (np.ndarray) – The coords must be of shape (N, 3) and correspond to coordinates of mol.
  • mol (rdkit.Chem.rdchem.Mol or MolecularFragment) – The molecule to strip
Returns:

A tuple of (coords, mol_frag) where coords is a numpy array of coordinates with hydrogen coordinates. mol_frag is a MolecularFragment.

Return type:

Tuple[np.ndarray, MolecularFragment]

Notes

This function requires RDKit to be installed.

deepchem.utils.fragment_utils.merge_molecular_fragments(molecules: List[deepchem.utils.fragment_utils.MolecularFragment]) → Optional[deepchem.utils.fragment_utils.MolecularFragment][source]

Helper method to merge two molecular fragments.

Parameters:molecules (List[MolecularFragment]) – List of MolecularFragment objects.
Returns:Returns a merged MolecularFragment
Return type:Optional[MolecularFragment]
deepchem.utils.fragment_utils.get_contact_atom_indices(fragments: List[Tuple[numpy.ndarray, Any]], cutoff: float = 4.5) → List[List[int]][source]

Compute that atoms close to contact region.

Molecular complexes can get very large. This can make it unwieldy to compute functions on them. To improve memory usage, it can be very useful to trim out atoms that aren’t close to contact regions. This function computes pairwise distances between all pairs of molecules in the molecular complex. If an atom is within cutoff distance of any atom on another molecule in the complex, it is regarded as a contact atom. Otherwise it is trimmed.

Parameters:
  • fragments (List[Tuple[np.ndarray, rdkit.Chem.rdchem.Mol]]) – As returned by rdkit_utils.load_complex, a list of tuples of (coords, mol) where coords is a (N_atoms, 3) array and mol is the rdkit molecule object.
  • cutoff (float, optional (default 4.5)) – The cutoff distance in angstroms.
Returns:

A list of length len(molecular_complex). Each entry in this list is a list of atom indices from that molecule which should be kept, in sorted order.

Return type:

List[List[int]]

deepchem.utils.fragment_utils.reduce_molecular_complex_to_contacts(fragments: List[Tuple[numpy.ndarray, Any]], cutoff: float = 4.5) → List[Tuple[numpy.ndarray, deepchem.utils.fragment_utils.MolecularFragment]][source]

Reduce a molecular complex to only those atoms near a contact.

Molecular complexes can get very large. This can make it unwieldy to compute functions on them. To improve memory usage, it can be very useful to trim out atoms that aren’t close to contact regions. This function takes in a molecular complex and returns a new molecular complex representation that contains only contact atoms. The contact atoms are computed by calling get_contact_atom_indices under the hood.

Parameters:
  • fragments (List[Tuple[np.ndarray, rdkit.Chem.rdchem.Mol]]) – As returned by rdkit_utils.load_complex, a list of tuples of (coords, mol) where coords is a (N_atoms, 3) array and mol is the rdkit molecule object.
  • cutoff (float) – The cutoff distance in angstroms.
Returns:

A list of length len(molecular_complex). Each entry in this list is a tuple of (coords, MolecularFragment). The coords is stripped down to (N_contact_atoms, 3) where N_contact_atoms is the number of contact atoms for this complex. MolecularFragment is used since it’s tricky to make a RDKit sub-molecule.

Return type:

List[Tuple[np.ndarray, MolecularFragment]]

Coordinate Box Utilities

class deepchem.utils.coordinate_box_utils.CoordinateBox(x_range: Tuple[float, float], y_range: Tuple[float, float], z_range: Tuple[float, float])[source]

A coordinate box that represents a block in space.

Molecular complexes are typically represented with atoms as coordinate points. Each complex is naturally associated with a number of different box regions. For example, the bounding box is a box that contains all atoms in the molecular complex. A binding pocket box is a box that focuses in on a binding region of a protein to a ligand. A interface box is the region in which two proteins have a bulk interaction.

The CoordinateBox class is designed to represent such regions of space. It consists of the coordinates of the box, and the collection of atoms that live in this box alongside their coordinates.

__init__(x_range: Tuple[float, float], y_range: Tuple[float, float], z_range: Tuple[float, float])[source]

Initialize this box.

Parameters:
  • x_range (Tuple[float, float]) – A tuple of (x_min, x_max) with max and min x-coordinates.
  • y_range (Tuple[float, float]) – A tuple of (y_min, y_max) with max and min y-coordinates.
  • z_range (Tuple[float, float]) – A tuple of (z_min, z_max) with max and min z-coordinates.
Raises:

ValueError if this interval is malformed

center() → Tuple[float, float, float][source]

Computes the center of this box.

Returns:(x, y, z) the coordinates of the center of the box.
Return type:Tuple[float, float, float]

Examples

>>> box = CoordinateBox((0, 1), (0, 1), (0, 1))
>>> box.center()
(0.5, 0.5, 0.5)
contains(other: deepchem.utils.coordinate_box_utils.CoordinateBox) → bool[source]

Test whether this box contains another.

This method checks whether other is contained in this box.

Parameters:other (CoordinateBox) – The box to check is contained in this box.
Returns:True if other is contained in this box.
Return type:bool
Raises:ValueError if not isinstance(other, CoordinateBox).
volume() → float[source]

Computes and returns the volume of this box.

Returns:The volume of this box. Can be 0 if box is empty
Return type:float

Examples

>>> box = CoordinateBox((0, 1), (0, 1), (0, 1))
>>> box.volume()
1
deepchem.utils.coordinate_box_utils.intersect_interval(interval1: Tuple[float, float], interval2: Tuple[float, float]) → Tuple[float, float][source]

Computes the intersection of two intervals.

Parameters:
  • interval1 (Tuple[float, float]) – Should be (x1_min, x1_max)
  • interval2 (Tuple[float, float]) – Should be (x2_min, x2_max)
Returns:

x_intersect – Should be the intersection. If the intersection is empty returns (0, 0) to represent the empty set. Otherwise is (max(x1_min, x2_min), min(x1_max, x2_max)).

Return type:

Tuple[float, float]

deepchem.utils.coordinate_box_utils.union(box1: deepchem.utils.coordinate_box_utils.CoordinateBox, box2: deepchem.utils.coordinate_box_utils.CoordinateBox) → deepchem.utils.coordinate_box_utils.CoordinateBox[source]

Merges provided boxes to find the smallest union box.

This method merges the two provided boxes.

Parameters:
Returns:

Smallest CoordinateBox that contains both box1 and box2

Return type:

CoordinateBox

deepchem.utils.coordinate_box_utils.merge_overlapping_boxes(boxes: List[deepchem.utils.coordinate_box_utils.CoordinateBox], threshold: float = 0.8) → List[deepchem.utils.coordinate_box_utils.CoordinateBox][source]

Merge boxes which have an overlap greater than threshold.

Parameters:
  • boxes (list[CoordinateBox]) – A list of CoordinateBox objects.
  • threshold (float, default 0.8) – The volume fraction of the boxes that must overlap for them to be merged together.
Returns:

List[CoordinateBox] of merged boxes. This list will have length less than or equal to the length of boxes.

Return type:

List[CoordinateBox]

deepchem.utils.coordinate_box_utils.get_face_boxes(coords: numpy.ndarray, pad: float = 5.0) → List[deepchem.utils.coordinate_box_utils.CoordinateBox][source]

For each face of the convex hull, compute a coordinate box around it.

The convex hull of a macromolecule will have a series of triangular faces. For each such triangular face, we construct a bounding box around this triangle. Think of this box as attempting to capture some binding interaction region whose exterior is controlled by the box. Note that this box will likely be a crude approximation, but the advantage of this technique is that it only uses simple geometry to provide some basic biological insight into the molecule at hand.

The pad parameter is used to control the amount of padding around the face to be used for the coordinate box.

Parameters:
  • coords (np.ndarray) – A numpy array of shape (N, 3). The coordinates of a molecule.
  • pad (float, optional (default 5.0)) – The number of angstroms to pad.
Returns:

boxes – List of CoordinateBox

Return type:

List[CoordinateBox]

Examples

>>> coords = np.array([[0, 0, 0], [1, 0, 0], [0, 1, 0], [0, 0, 1]])
>>> boxes = get_face_boxes(coords, pad=5)

Evaluation Utils

class deepchem.utils.evaluate.Evaluator(model, dataset: deepchem.data.datasets.Dataset, transformers: List[dc.trans.Transformer])[source]

Class that evaluates a model on a given dataset.

The evaluator class is used to evaluate a dc.models.Model class on a given dc.data.Dataset object. The evaluator is aware of dc.trans.Transformer objects so will automatically undo any transformations which have been applied.

Examples

Evaluators allow for a model to be evaluated directly on a Metric for sklearn. Let’s do a bit of setup constructing our dataset and model.

>>> import deepchem as dc
>>> import numpy as np
>>> X = np.random.rand(10, 5)
>>> y = np.random.rand(10, 1)
>>> dataset = dc.data.NumpyDataset(X, y)
>>> model = dc.models.MultitaskRegressor(1, 5)
>>> transformers = []

Then you can evaluate this model as follows >>> import sklearn >>> evaluator = Evaluator(model, dataset, transformers) >>> multitask_scores = evaluator.compute_model_performance( … sklearn.metrics.mean_absolute_error)

Evaluators can also be used with dc.metrics.Metric objects as well in case you want to customize your metric further.

>>> evaluator = Evaluator(model, dataset, transformers)
>>> metric = dc.metrics.Metric(dc.metrics.mae_score)
>>> multitask_scores = evaluator.compute_model_performance(metric)
__init__(model, dataset: deepchem.data.datasets.Dataset, transformers: List[dc.trans.Transformer])[source]

Initialize this evaluator

Parameters:
  • model (Model) – Model to evaluate. Note that this must be a regression or classification model and not a generative model.
  • dataset (Dataset) – Dataset object to evaluate model on.
  • transformers (List[Transformer]) – List of dc.trans.Transformer objects. These transformations must have been applied to dataset previously. The dataset will be untransformed for metric evaluation.
compute_model_performance(metrics: Union[deepchem.metrics.metric.Metric, Callable[[...], Any], List[deepchem.metrics.metric.Metric], List[Callable[[...], Any]]], csv_out: Optional[str] = None, stats_out: Optional[str] = None, per_task_metrics: bool = False, use_sample_weights: bool = False, n_classes: int = 2) → Union[Dict[str, float], Tuple[Dict[str, float], Dict[str, float]]][source]

Computes statistics of model on test data and saves results to csv.

Parameters:
  • metrics (dc.metrics.Metric/list[dc.metrics.Metric]/function) – The set of metrics provided. This class attempts to do some intelligent handling of input. If a single dc.metrics.Metric object is provided or a list is provided, it will evaluate self.model on these metrics. If a function is provided, it is assumed to be a metric function that this method will attempt to wrap in a dc.metrics.Metric object. A metric function must accept two arguments, y_true, y_pred both of which are np.ndarray objects and return a floating point score. The metric function may also accept a keyword argument sample_weight to account for per-sample weights.
  • csv_out (str, optional (DEPRECATED)) – Filename to write CSV of model predictions.
  • stats_out (str, optional (DEPRECATED)) – Filename to write computed statistics.
  • per_task_metrics (bool, optional) – If true, return computed metric for each task on multitask dataset.
  • use_sample_weights (bool, optional (default False)) – If set, use per-sample weights w.
  • n_classes (int, optional (default None)) – If specified, will use n_classes as the number of unique classes in self.dataset. Note that this argument will be ignored for regression metrics.
Returns:

  • multitask_scores (dict) – Dictionary mapping names of metrics to metric scores.
  • all_task_scores (dict, optional) – If per_task_metrics == True, then returns a second dictionary of scores for each task separately.

output_predictions(y_preds: numpy.ndarray, csv_out: str)[source]

Writes predictions to file.

Writes predictions made on self.dataset to a specified file on disk. self.dataset.ids are used to format predictions.

Parameters:
  • y_preds (np.ndarray) – Predictions to output
  • csv_out (str) – Name of file to write predictions to.
output_statistics(scores: Dict[str, float], stats_out: str)[source]

Write computed stats to file.

Parameters:
  • scores (dict) – Dictionary mapping names of metrics to scores.
  • stats_out (str) – Name of file to write scores to.
class deepchem.utils.evaluate.GeneratorEvaluator(model, generator: Iterable[Tuple[Any, Any, Any]], transformers: List[dc.trans.Transformer], labels: Optional[List[T]] = None, weights: Optional[List[T]] = None)[source]

Evaluate models on a stream of data.

This class is a partner class to Evaluator. Instead of operating over datasets this class operates over a generator which yields batches of data to feed into provided model.

Examples

>>> import deepchem as dc
>>> import numpy as np
>>> X = np.random.rand(10, 5)
>>> y = np.random.rand(10, 1)
>>> dataset = dc.data.NumpyDataset(X, y)
>>> model = dc.models.MultitaskRegressor(1, 5)
>>> generator = model.default_generator(dataset, pad_batches=False)
>>> transformers = []

Then you can evaluate this model as follows

>>> import sklearn
>>> evaluator = GeneratorEvaluator(model, generator, transformers)
>>> multitask_scores = evaluator.compute_model_performance(
...     sklearn.metrics.mean_absolute_error)

Evaluators can also be used with dc.metrics.Metric objects as well in case you want to customize your metric further. (Note that a given generator can only be used once so we have to redefine the generator here.)

>>> generator = model.default_generator(dataset, pad_batches=False)
>>> evaluator = GeneratorEvaluator(model, generator, transformers)
>>> metric = dc.metrics.Metric(dc.metrics.mae_score)
>>> multitask_scores = evaluator.compute_model_performance(metric)
__init__(model, generator: Iterable[Tuple[Any, Any, Any]], transformers: List[dc.trans.Transformer], labels: Optional[List[T]] = None, weights: Optional[List[T]] = None)[source]
Parameters:
  • model (Model) – Model to evaluate.
  • generator (generator) – Generator which yields batches to feed into the model. For a KerasModel, it should be a tuple of the form (inputs, labels, weights). The “correct” way to create this generator is to use model.default_generator as shown in the example above.
  • transformers (List[Transformer]) – Tranformers to “undo” when applied to the models outputs
  • labels (list of Layer) – layers which are keys in the generator to compare to outputs
  • weights (list of Layer) – layers which are keys in the generator for weight matrices
compute_model_performance(metrics: Union[deepchem.metrics.metric.Metric, Callable[[...], Any], List[deepchem.metrics.metric.Metric], List[Callable[[...], Any]]], per_task_metrics: bool = False, use_sample_weights: bool = False, n_classes: int = 2) → Union[Dict[str, float], Tuple[Dict[str, float], Dict[str, float]]][source]

Computes statistics of model on test data and saves results to csv.

Parameters:
  • metrics (dc.metrics.Metric/list[dc.metrics.Metric]/function) – The set of metrics provided. This class attempts to do some intelligent handling of input. If a single dc.metrics.Metric object is provided or a list is provided, it will evaluate self.model on these metrics. If a function is provided, it is assumed to be a metric function that this method will attempt to wrap in a dc.metrics.Metric object. A metric function must accept two arguments, y_true, y_pred both of which are np.ndarray objects and return a floating point score.
  • per_task_metrics (bool, optional) – If true, return computed metric for each task on multitask dataset.
  • use_sample_weights (bool, optional (default False)) – If set, use per-sample weights w.
  • n_classes (int, optional (default None)) – If specified, will assume that all metrics are classification metrics and will use n_classes as the number of unique classes in self.dataset.
Returns:

  • multitask_scores (dict) – Dictionary mapping names of metrics to metric scores.
  • all_task_scores (dict, optional) – If per_task_metrics == True, then returns a second dictionary of scores for each task separately.

deepchem.utils.evaluate.relative_difference(x: numpy.ndarray, y: numpy.ndarray) → numpy.ndarray[source]

Compute the relative difference between x and y

The two argument arrays must have the same shape.

Parameters:
  • x (np.ndarray) – First input array
  • y (np.ndarray) – Second input array
Returns:

z – We will have z == np.abs(x-y) / np.abs(max(x, y)).

Return type:

np.ndarray

Genomic Utilities

deepchem.utils.genomics_utils.seq_one_hot_encode(sequences: Union[numpy.ndarray, Iterator[Iterable[str]]], letters: str = 'ATCGN') → numpy.ndarray[source]

One hot encodes list of genomic sequences.

Sequences encoded have shape (N_sequences, N_letters, sequence_length, 1). These sequences will be processed as images with one color channel.

Parameters:
  • sequences (np.ndarray or Iterator[Bio.SeqRecord]) – Iterable object of genetic sequences
  • letters (str, optional (default "ATCGN")) – String with the set of possible letters in the sequences.
Raises:

ValueError: – If sequences are of different lengths.

Returns:

A numpy array of shape (N_sequences, N_letters, sequence_length, 1).

Return type:

np.ndarray

deepchem.utils.genomics_utils.encode_bio_sequence(fname: str, file_type: str = 'fasta', letters: str = 'ATCGN') → numpy.ndarray[source]

Loads a sequence file and returns an array of one-hot sequences.

Parameters:
  • fname (str) – Filename of fasta file.
  • file_type (str, optional (default "fasta")) – The type of file encoding to process, e.g. fasta or fastq, this is passed to Biopython.SeqIO.parse.
  • letters (str, optional (default "ATCGN")) – The set of letters that the sequences consist of, e.g. ATCG.
Returns:

A numpy array of shape (N_sequences, N_letters, sequence_length, 1).

Return type:

np.ndarray

Notes

This function requires BioPython to be installed.

Geometry Utilities

deepchem.utils.geometry_utils.unit_vector(vector: numpy.ndarray) → numpy.ndarray[source]

Returns the unit vector of the vector.

Parameters:vector (np.ndarray) – A numpy array of shape (3,), where 3 is (x,y,z).
Returns:A numpy array of shape (3,). The unit vector of the input vector.
Return type:np.ndarray
deepchem.utils.geometry_utils.angle_between(vector_i: numpy.ndarray, vector_j: numpy.ndarray) → numpy.ndarray[source]

Returns the angle in radians between vectors “vector_i” and “vector_j”

Note that this function always returns the smaller of the two angles between the vectors (value between 0 and pi).

Parameters:
  • vector_i (np.ndarray) – A numpy array of shape (3,), where 3 is (x,y,z).
  • vector_j (np.ndarray) – A numpy array of shape (3,), where 3 is (x,y,z).
Returns:

The angle in radians between the two vectors.

Return type:

np.ndarray

Examples

>>> print("%0.06f" % angle_between((1, 0, 0), (0, 1, 0)))
1.570796
>>> print("%0.06f" % angle_between((1, 0, 0), (1, 0, 0)))
0.000000
>>> print("%0.06f" % angle_between((1, 0, 0), (-1, 0, 0)))
3.141593
deepchem.utils.geometry_utils.generate_random_unit_vector() → numpy.ndarray[source]

Generate a random unit vector on the sphere S^2.

Citation: http://mathworld.wolfram.com/SpherePointPicking.html

Pseudocode:
  1. Choose random theta element [0, 2*pi]
  2. Choose random z element [-1, 1]
  3. Compute output vector u: (x,y,z) = (sqrt(1-z^2)*cos(theta), sqrt(1-z^2)*sin(theta),z)
Returns:u – A numpy array of shape (3,). u is an unit vector
Return type:np.ndarray
deepchem.utils.geometry_utils.generate_random_rotation_matrix() → numpy.ndarray[source]

Generates a random rotation matrix.

  1. Generate a random unit vector u, randomly sampled from the unit sphere (see function generate_random_unit_vector() for details)
  2. Generate a second random unit vector v
  1. If absolute value of u dot v > 0.99, repeat. (This is important for numerical stability. Intuition: we want them to be as linearly independent as possible or else the orthogonalized version of v will be much shorter in magnitude compared to u. I assume in Stack they took this from Gram-Schmidt orthogonalization?)
  2. v” = v - (u dot v)*u, i.e. subtract out the component of v that’s in u’s direction
  3. normalize v” (this isn”t in Stack but I assume it must be done)
  1. find w = u cross v”
  2. u, v”, and w will form the columns of a rotation matrix, R. The intuition is that u, v” and w are, respectively, what the standard basis vectors e1, e2, and e3 will be mapped to under the transformation.
Returns:R – A numpy array of shape (3, 3). R is a rotation matrix.
Return type:np.ndarray
deepchem.utils.geometry_utils.is_angle_within_cutoff(vector_i: numpy.ndarray, vector_j: numpy.ndarray, angle_cutoff: float) → bool[source]

A utility function to compute whether two vectors are within a cutoff from 180 degrees apart.

Parameters:
  • vector_i (np.ndarray) – A numpy array of shape (3,)`, where 3 is (x,y,z).
  • vector_j (np.ndarray) – A numpy array of shape (3,), where 3 is (x,y,z).
  • cutoff (float) – The deviation from 180 (in degrees)
Returns:

Whether two vectors are within a cutoff from 180 degrees apart

Return type:

bool

Hash Function Utilities

deepchem.utils.hash_utils.hash_ecfp(ecfp: str, size: int = 1024) → int[source]

Returns an int < size representing given ECFP fragment.

Input must be a string. This utility function is used for various ECFP based fingerprints.

Parameters:
  • ecfp (str) – String to hash. Usually an ECFP fragment.
  • size (int, optional (default 1024)) – Hash to an int in range [0, size)
Returns:

ecfp_hash – An int < size representing given ECFP fragment

Return type:

int

deepchem.utils.hash_utils.hash_ecfp_pair(ecfp_pair: Tuple[str, str], size: int = 1024) → int[source]

Returns an int < size representing that ECFP pair.

Input must be a tuple of strings. This utility is primarily used for spatial contact featurizers. For example, if a protein and ligand have close contact region, the first string could be the protein’s fragment and the second the ligand’s fragment. The pair could be hashed together to achieve one hash value for this contact region.

Parameters:
  • ecfp_pair (Tuple[str, str]) – Pair of ECFP fragment strings
  • size (int, optional (default 1024)) – Hash to an int in range [0, size)
Returns:

ecfp_hash – An int < size representing given ECFP pair.

Return type:

int

deepchem.utils.hash_utils.vectorize(hash_function: Callable[[str, int], int], feature_dict: Optional[Dict[int, str]] = None, size: int = 1024) → numpy.ndarray[source]

Helper function to vectorize a spatial description from a hash.

Hash functions are used to perform spatial featurizations in DeepChem. However, it’s necessary to convert backwards from the hash function to feature vectors. This function aids in this conversion procedure. It creates a vector of zeros of length size. It then loops through feature_dict, uses hash_function to hash the stored value to an integer in range [0, size) and bumps that index.

Parameters:
  • hash_function (Function, Callable[[str, int], int]) – Should accept two arguments, feature, and size and return a hashed integer. Here feature is the item to hash, and size is an int. For example, if size=1024, then hashed values must fall in range [0, 1024).
  • feature_dict (Dict, optional (default None)) – Maps unique keys to features computed.
  • size (int, optional (default 1024)) – Length of generated bit vector
Returns:

feature_vector – A numpy array of shape (size,)

Return type:

np.ndarray

Voxel Utils

deepchem.utils.voxel_utils.convert_atom_to_voxel(coordinates: numpy.ndarray, atom_index: int, box_width: float, voxel_width: float) → numpy.ndarray[source]

Converts atom coordinates to an i,j,k grid index.

This function offsets molecular atom coordinates by (box_width/2, box_width/2, box_width/2) and then divides by voxel_width to compute the voxel indices.

Parameters:
  • coordinates (np.ndarray) – Array with coordinates of all atoms in the molecule, shape (N, 3).
  • atom_index (int) – Index of an atom in the molecule.
  • box_width (float) – Size of the box in Angstroms.
  • voxel_width (float) – Size of a voxel in Angstroms
Returns:

indices – A 1D numpy array of length 3 with [i, j, k], the voxel coordinates of specified atom.

Return type:

np.ndarray

deepchem.utils.voxel_utils.convert_atom_pair_to_voxel(coordinates_tuple: Tuple[numpy.ndarray, numpy.ndarray], atom_index_pair: Tuple[int, int], box_width: float, voxel_width: float) → numpy.ndarray[source]

Converts a pair of atoms to i,j,k grid indexes.

Parameters:
  • coordinates_tuple (Tuple[np.ndarray, np.ndarray]) – A tuple containing two molecular coordinate arrays of shapes (N, 3) and (M, 3).
  • atom_index_pair (Tuple[int, int]) – A tuple of indices for the atoms in the two molecules.
  • box_width (float) – Size of the box in Angstroms.
  • voxel_width (float) – Size of a voxel in Angstroms
Returns:

indices_list – A numpy array of shape (2, 3), where 3 is [i, j, k] of the voxel coordinates of specified atom.

Return type:

np.ndarray

deepchem.utils.voxel_utils.voxelize(get_voxels: Callable[[...], Any], hash_function: Callable[[...], Any], coordinates: numpy.ndarray, box_width: float = 16.0, voxel_width: float = 1.0, feature_dict: Optional[Dict[Union[int, Tuple[int]], Any]] = None, feature_list: Optional[List[Union[int, Tuple[int]]]] = None, nb_channel: int = 16, dtype: str = 'int') → numpy.ndarray[source]

Helper function to voxelize inputs.

This helper function helps convert a hash function which specifies spatial features of a molecular complex into a voxel tensor. This utility is used by various featurizers that generate voxel grids.

Parameters:
  • get_voxels (Function) – Function that voxelizes inputs
  • hash_function (Function) – Used to map feature choices to voxel channels.
  • coordinates (np.ndarray) – Contains the 3D coordinates of a molecular system.
  • box_width (float, optional (default 16.0)) – Size of a box in which voxel features are calculated. Box is centered on a ligand centroid.
  • voxel_width (float, optional (default 1.0)) – Size of a 3D voxel in a grid in Angstroms.
  • feature_dict (Dict, optional (default None)) – Keys are atom indices or tuples of atom indices, the values are computed features. If hash_function is not None, then the values are hashed using the hash function into [0, nb_channels) and this channel at the voxel for the given key is incremented by 1 for each dictionary entry. If hash_function is None, then the value must be a vector of size (n_channels,) which is added to the existing channel values at that voxel grid.
  • feature_list (List, optional (default None)) – List of atom indices or tuples of atom indices. This can only be used if nb_channel==1. Increments the voxels corresponding to these indices by 1 for each entry.
  • nb_channel (int, , optional (default 16)) – The number of feature channels computed per voxel. Should be a power of 2.
  • dtype (str ('int' or 'float'), optional (default 'int')) – The type of the numpy ndarray created to hold features.
Returns:

feature_tensor – The voxel of the input with the shape (voxels_per_edge, voxels_per_edge, voxels_per_edge, nb_channel).

Return type:

np.ndarray

Graph Convolution Utilities

deepchem.utils.molecule_feature_utils.one_hot_encode(val: Union[int, str], allowable_set: Union[List[str], List[int]], include_unknown_set: bool = False) → List[float][source]

One hot encoder for elements of a provided set.

Examples

>>> one_hot_encode("a", ["a", "b", "c"])
[1.0, 0.0, 0.0]
>>> one_hot_encode(2, [0, 1, 2])
[0.0, 0.0, 1.0]
>>> one_hot_encode(3, [0, 1, 2])
[0.0, 0.0, 0.0]
>>> one_hot_encode(3, [0, 1, 2], True)
[0.0, 0.0, 0.0, 1.0]
Parameters:
  • val (int or str) – The value must be present in allowable_set.
  • allowable_set (List[int] or List[str]) – List of allowable quantities.
  • include_unknown_set (bool, default False) – If true, the index of all values not in allowable_set is len(allowable_set).
Returns:

An one-hot vector of val. If include_unknown_set is False, the length is len(allowable_set). If include_unknown_set is True, the length is len(allowable_set) + 1.

Return type:

List[float]

Raises:

ValueError – If include_unknown_set is False and val is not in allowable_set.

deepchem.utils.molecule_feature_utils.get_atom_type_one_hot(atom: Any, allowable_set: List[str] = ['C', 'N', 'O', 'F', 'P', 'S', 'Cl', 'Br', 'I'], include_unknown_set: bool = True) → List[float][source]

Get an one-hot feature of an atom type.

Parameters:
  • atom (rdkit.Chem.rdchem.Atom) – RDKit atom object
  • allowable_set (List[str]) – The atom types to consider. The default set is [“C”, “N”, “O”, “F”, “P”, “S”, “Cl”, “Br”, “I”].
  • include_unknown_set (bool, default True) – If true, the index of all atom not in allowable_set is len(allowable_set).
Returns:

An one-hot vector of atom types. If include_unknown_set is False, the length is len(allowable_set). If include_unknown_set is True, the length is len(allowable_set) + 1.

Return type:

List[float]

deepchem.utils.molecule_feature_utils.construct_hydrogen_bonding_info(mol: Any) → List[Tuple[int, str]][source]

Construct hydrogen bonding infos about a molecule.

Parameters:mol (rdkit.Chem.rdchem.Mol) – RDKit mol object
Returns:A list of tuple (atom_index, hydrogen_bonding_type). The hydrogen_bonding_type value is “Acceptor” or “Donor”.
Return type:List[Tuple[int, str]]
deepchem.utils.molecule_feature_utils.get_atom_hydrogen_bonding_one_hot(atom: Any, hydrogen_bonding: List[Tuple[int, str]]) → List[float][source]

Get an one-hot feat about whether an atom accepts electrons or donates electrons.

Parameters:
  • atom (rdkit.Chem.rdchem.Atom) – RDKit atom object
  • hydrogen_bonding (List[Tuple[int, str]]) – The return value of construct_hydrogen_bonding_info. The value is a list of tuple (atom_index, hydrogen_bonding) like (1, “Acceptor”).
Returns:

A one-hot vector of the ring size type. The first element indicates “Donor”, and the second element indicates “Acceptor”.

Return type:

List[float]

deepchem.utils.molecule_feature_utils.get_atom_is_in_aromatic_one_hot(atom: Any) → List[float][source]

Get ans one-hot feature about whether an atom is in aromatic system or not.

Parameters:atom (rdkit.Chem.rdchem.Atom) – RDKit atom object
Returns:A vector of whether an atom is in aromatic system or not.
Return type:List[float]
deepchem.utils.molecule_feature_utils.get_atom_hybridization_one_hot(atom: Any, allowable_set: List[str] = ['SP', 'SP2', 'SP3'], include_unknown_set: bool = False) → List[float][source]

Get an one-hot feature of hybridization type.

Parameters:
  • atom (rdkit.Chem.rdchem.Atom) – RDKit atom object
  • allowable_set (List[str]) – The hybridization types to consider. The default set is [“SP”, “SP2”, “SP3”]
  • include_unknown_set (bool, default False) – If true, the index of all types not in allowable_set is len(allowable_set).
Returns:

An one-hot vector of the hybridization type. If include_unknown_set is False, the length is len(allowable_set). If include_unknown_set is True, the length is len(allowable_set) + 1.

Return type:

List[float]

deepchem.utils.molecule_feature_utils.get_atom_total_num_Hs_one_hot(atom: Any, allowable_set: List[int] = [0, 1, 2, 3, 4], include_unknown_set: bool = True) → List[float][source]

Get an one-hot feature of the number of hydrogens which an atom has.

Parameters:
  • atom (rdkit.Chem.rdchem.Atom) – RDKit atom object
  • allowable_set (List[int]) – The number of hydrogens to consider. The default set is [0, 1, …, 4]
  • include_unknown_set (bool, default True) – If true, the index of all types not in allowable_set is len(allowable_set).
Returns:

A one-hot vector of the number of hydrogens which an atom has. If include_unknown_set is False, the length is len(allowable_set). If include_unknown_set is True, the length is len(allowable_set) + 1.

Return type:

List[float]

deepchem.utils.molecule_feature_utils.get_atom_chirality_one_hot(atom: Any) → List[float][source]

Get an one-hot feature about an atom chirality type.

Parameters:atom (rdkit.Chem.rdchem.Atom) – RDKit atom object
Returns:A one-hot vector of the chirality type. The first element indicates “R”, and the second element indicates “S”.
Return type:List[float]
deepchem.utils.molecule_feature_utils.get_atom_formal_charge(atom: Any) → List[float][source]

Get a formal charge of an atom.

Parameters:atom (rdkit.Chem.rdchem.Atom) – RDKit atom object
Returns:A vector of the formal charge.
Return type:List[float]
deepchem.utils.molecule_feature_utils.get_atom_partial_charge(atom: Any) → List[float][source]

Get a partial charge of an atom.

Parameters:atom (rdkit.Chem.rdchem.Atom) – RDKit atom object
Returns:A vector of the parital charge.
Return type:List[float]

Notes

Before using this function, you must calculate GasteigerCharge like AllChem.ComputeGasteigerCharges(mol).

deepchem.utils.molecule_feature_utils.get_atom_total_degree_one_hot(atom: Any, allowable_set: List[int] = [0, 1, 2, 3, 4, 5], include_unknown_set: bool = True) → List[float][source]

Get an one-hot feature of the degree which an atom has.

Parameters:
  • atom (rdkit.Chem.rdchem.Atom) – RDKit atom object
  • allowable_set (List[int]) – The degree to consider. The default set is [0, 1, …, 5]
  • include_unknown_set (bool, default True) – If true, the index of all types not in allowable_set is len(allowable_set).
Returns:

A one-hot vector of the degree which an atom has. If include_unknown_set is False, the length is len(allowable_set). If include_unknown_set is True, the length is len(allowable_set) + 1.

Return type:

List[float]

deepchem.utils.molecule_feature_utils.get_bond_type_one_hot(bond: Any, allowable_set: List[str] = ['SINGLE', 'DOUBLE', 'TRIPLE', 'AROMATIC'], include_unknown_set: bool = False) → List[float][source]

Get an one-hot feature of bond type.

Parameters:
  • bond (rdkit.Chem.rdchem.Bond) – RDKit bond object
  • allowable_set (List[str]) – The bond types to consider. The default set is [“SINGLE”, “DOUBLE”, “TRIPLE”, “AROMATIC”].
  • include_unknown_set (bool, default False) – If true, the index of all types not in allowable_set is len(allowable_set).
Returns:

A one-hot vector of the bond type. If include_unknown_set is False, the length is len(allowable_set). If include_unknown_set is True, the length is len(allowable_set) + 1.

Return type:

List[float]

deepchem.utils.molecule_feature_utils.get_bond_is_in_same_ring_one_hot(bond: Any) → List[float][source]

Get an one-hot feature about whether atoms of a bond is in the same ring or not.

Parameters:bond (rdkit.Chem.rdchem.Bond) – RDKit bond object
Returns:A one-hot vector of whether a bond is in the same ring or not.
Return type:List[float]
deepchem.utils.molecule_feature_utils.get_bond_is_conjugated_one_hot(bond: Any) → List[float][source]

Get an one-hot feature about whether a bond is conjugated or not.

Parameters:bond (rdkit.Chem.rdchem.Bond) – RDKit bond object
Returns:A one-hot vector of whether a bond is conjugated or not.
Return type:List[float]
deepchem.utils.molecule_feature_utils.get_bond_stereo_one_hot(bond: Any, allowable_set: List[str] = ['STEREONONE', 'STEREOANY', 'STEREOZ', 'STEREOE'], include_unknown_set: bool = True) → List[float][source]

Get an one-hot feature of the stereo configuration of a bond.

Parameters:
  • bond (rdkit.Chem.rdchem.Bond) – RDKit bond object
  • allowable_set (List[str]) – The stereo configuration types to consider. The default set is [“STEREONONE”, “STEREOANY”, “STEREOZ”, “STEREOE”].
  • include_unknown_set (bool, default True) – If true, the index of all types not in allowable_set is len(allowable_set).
Returns:

A one-hot vector of the stereo configuration of a bond. If include_unknown_set is False, the length is len(allowable_set). If include_unknown_set is True, the length is len(allowable_set) + 1.

Return type:

List[float]

deepchem.utils.molecule_feature_utils.get_bond_graph_distance_one_hot(bond: Any, graph_dist_matrix: numpy.ndarray, allowable_set: List[int] = [1, 2, 3, 4, 5, 6, 7], include_unknown_set: bool = True) → List[float][source]

Get an one-hot feature of graph distance.

Parameters:
  • bond (rdkit.Chem.rdchem.Bond) – RDKit bond object
  • graph_dist_matrix (np.ndarray) – The return value of Chem.GetDistanceMatrix(mol). The shape is (num_atoms, num_atoms).
  • allowable_set (List[int]) – The graph distance types to consider. The default set is [1, 2, …, 7].
  • include_unknown_set (bool, default False) – If true, the index of all types not in allowable_set is len(allowable_set).
Returns:

A one-hot vector of the graph distance. If include_unknown_set is False, the length is len(allowable_set). If include_unknown_set is True, the length is len(allowable_set) + 1.

Return type:

List[float]

Debug Utilities