Utilities

DeepChem has a broad collection of utility functions. Many of these maybe be of independent interest to users since they deal with some tricky aspects of processing scientific datatypes.

Array Utilities

deepchem.utils.pad_array(x, shape, fill=0, both=False)[source]

Pad an array with a fill value.

Parameters:
  • x (ndarray) – Matrix.
  • shape (tuple or int) – Desired shape. If int, all dimensions are padded to that size.
  • fill (object, optional (default 0)) – Fill value.
  • both (bool, optional (default False)) – If True, split the padding on both sides of each axis. If False, padding is applied to the end of each axis.

Data Directory

The DeepChem data directory is where downloaded MoleculeNet datasets are stored.

deepchem.utils.get_data_dir()[source]

Get the DeepChem data directory.

URL Handling

deepchem.utils.download_url(url, dest_dir='/tmp', name=None)[source]

Download a file to disk.

Parameters:
  • url (str) – the URL to download from
  • dest_dir (str) – the directory to save the file in
  • name (str) – the file name to save it as. If omitted, it will try to extract a file name from the URL

File Handling

deepchem.utils.untargz_file(file, dest_dir='/tmp', name=None)[source]

Untar and unzip a .tar.gz file to disk.

Parameters:
  • file (str) – the filepath to decompress
  • dest_dir (str) – the directory to save the file in
  • name (str) – the file name to save it as. If omitted, it will use the file name
deepchem.utils.unzip_file(file, dest_dir=None, name=None)[source]

Unzip a .zip file to disk.

Parameters:
  • file (str) – the filepath to decompress
  • dest_dir (str) – the directory to save the file in
  • name (str) – the directory name to unzip it to. If omitted, it will use the file name
deepchem.utils.save.save_to_disk(dataset, filename, compress=3)[source]

Save a dataset to file.

deepchem.utils.save.get_input_type(input_file)[source]

Get type of input file. Must be csv/pkl.gz/sdf file.

deepchem.utils.save.load_data(input_files, shard_size=None, verbose=True)[source]

Loads data from disk.

For CSV files, supports sharded loading for large files.

deepchem.utils.save.load_sharded_csv(filenames)[source]

Load a dataset from multiple files. Each file MUST have same column headers

deepchem.utils.save.load_sdf_files(input_files, clean_mols, tasks=[])[source]

Load SDF file into dataframe.

deepchem.utils.save.load_csv_files(filenames, shard_size=None, verbose=True)[source]

Load data as pandas dataframe.

deepchem.utils.save.save_metadata(tasks, metadata_df, data_dir)[source]

Saves the metadata for a DiskDataset :param tasks: Tasks of DiskDataset :type tasks: list of str :param metadata_df: :type metadata_df: pd.DataFrame :param data_dir: Directory to store metadata :type data_dir: str

deepchem.utils.save.load_from_disk(filename)[source]

Load a dataset from file.

deepchem.utils.save.load_pickle_from_disk(filename)[source]

Load dataset from pickle file.

deepchem.utils.save.load_dataset_from_disk(save_dir)[source]
Parameters:save_dir (str) –
Returns:
  • loaded (bool) – Whether the load succeeded
  • all_dataset ((dc.data.Dataset, dc.data.Dataset, dc.data.Dataset)) – The train, valid, test datasets
  • transformers (list of dc.trans.Transformer) – The transformers used for this dataset
deepchem.utils.save.save_dataset_to_disk(save_dir, train, valid, test, transformers)[source]

Molecular Utilities

class deepchem.utils.ScaffoldGenerator(include_chirality=False)[source]

Generate molecular scaffolds.

Parameters:include_chirality (: bool, optional (default False)) – Include chirality in scaffolds.
__init__(include_chirality=False)[source]

Initialize self. See help(type(self)) for accurate signature.

get_scaffold(mol)[source]

Get Murcko scaffolds for molecules.

Murcko scaffolds are described in DOI: 10.1021/jm9602928. They are essentially that part of the molecule consisting of rings and the linker atoms between them.

Parameters:mols (array_like) – Molecules.
class deepchem.utils.conformers.ConformerGenerator(max_conformers=1, rmsd_threshold=0.5, force_field='uff', pool_multiplier=10)[source]

Generate molecule conformers.

Notes

Procedure 1. Generate a pool of conformers. 2. Minimize conformers. 3. Prune conformers using an RMSD threshold.

Note that pruning is done _after_ minimization, which differs from the protocol described in the references [1] [2].

References

[1]http://rdkit.org/docs/GettingStartedInPython.html#working-with-3d-molecules
[2]http://pubs.acs.org/doi/full/10.1021/ci2004658
Parameters:
  • max_conformers (int, optional (default 1)) – Maximum number of conformers to generate (after pruning).
  • rmsd_threshold (float, optional (default 0.5)) – RMSD threshold for pruning conformers. If None or negative, no pruning is performed.
  • force_field (str, optional (default 'uff')) – Force field to use for conformer energy calculation and minimization. Options are ‘uff’, ‘mmff94’, and ‘mmff94s’.
  • pool_multiplier (int, optional (default 10)) – Factor to multiply by max_conformers to generate the initial conformer pool. Since conformers are pruned after energy minimization, increasing the size of the pool increases the chance of identifying max_conformers unique conformers.
__init__(max_conformers=1, rmsd_threshold=0.5, force_field='uff', pool_multiplier=10)[source]

Initialize self. See help(type(self)) for accurate signature.

embed_molecule(mol)[source]

Generate conformers, possibly with pruning.

Parameters:mol (RDKit Mol) – Molecule.
generate_conformers(mol)[source]

Generate conformers for a molecule.

This function returns a copy of the original molecule with embedded conformers.

Parameters:mol (RDKit Mol) – Molecule.
get_conformer_energies(mol)[source]

Calculate conformer energies.

Parameters:mol (RDKit Mol) – Molecule.
Returns:energies – Minimized conformer energies.
Return type:array_like
static get_conformer_rmsd(mol)[source]

Calculate conformer-conformer RMSD.

Parameters:mol (RDKit Mol) – Molecule.
get_molecule_force_field(mol, conf_id=None, **kwargs)[source]

Get a force field for a molecule.

Parameters:
  • mol (RDKit Mol) – Molecule.
  • conf_id (int, optional) – ID of the conformer to associate with the force field.
  • kwargs (dict, optional) – Keyword arguments for force field constructor.
minimize_conformers(mol)[source]

Minimize molecule conformers.

Parameters:mol (RDKit Mol) – Molecule.
prune_conformers(mol)[source]

Prune conformers from a molecule using an RMSD threshold, starting with the lowest energy conformer.

Parameters:mol (RDKit Mol) – Molecule.
Returns:
  • A new RDKit Mol containing the chosen conformers, sorted by
  • increasing energy.
class deepchem.utils.rdkit_util.MoleculeLoadException(*args, **kwargs)[source]
__init__(*args, **kwargs)[source]

Initialize self. See help(type(self)) for accurate signature.

with_traceback()

Exception.with_traceback(tb) – set self.__traceback__ to tb and return self.

deepchem.utils.rdkit_util.get_xyz_from_mol(mol)[source]

Extracts a numpy array of coordinates from a molecules.

Returns a (N, 3) numpy array of 3d coords of given rdkit molecule

Parameters:mol (rdkit Molecule) – Molecule to extract coordinates for
Returns:
Return type:Numpy ndarray of shape (N, 3) where N = mol.GetNumAtoms().
deepchem.utils.rdkit_util.add_hydrogens_to_mol(mol, is_protein=False)[source]

Add hydrogens to a molecule object

Parameters:
  • mol (Rdkit Mol) – Molecule to hydrogenate
  • is_protein (bool, optional (default False)) – Whether this molecule is a protein.
Returns:

Return type:

Rdkit Mol

Note

This function requires RDKit and PDBFixer to be installed.

deepchem.utils.rdkit_util.compute_charges(mol)[source]

Attempt to compute Gasteiger Charges on Mol

This also has the side effect of calculating charges on mol. The mol passed into this function has to already have been sanitized

Parameters:mol (rdkit molecule) –
Returns:
Return type:No return since updates in place.

Note

This function requires RDKit to be installed.

deepchem.utils.rdkit_util.load_molecule(molecule_file, add_hydrogens=True, calc_charges=True, sanitize=True, is_protein=False)[source]

Converts molecule file to (xyz-coords, obmol object)

Given molecule_file, returns a tuple of xyz coords of molecule and an rdkit object representing that molecule in that order (xyz, rdkit_mol). This ordering convention is used in the code in a few places.

Parameters:
  • molecule_file (str) – filename for molecule
  • add_hydrogens (bool, optional (default True)) – If True, add hydrogens via pdbfixer
  • calc_charges (bool, optional (default True)) – If True, add charges via rdkit
  • sanitize (bool, optional (default False)) – If True, sanitize molecules via rdkit
  • is_protein (bool, optional (default False)) – If True`, this molecule is loaded as a protein. This flag will affect some of the cleanup procedures applied.
Returns:

  • Tuple (xyz, mol) if file contains single molecule. Else returns a
  • list of the tuples for the separate molecules in this list.

Note

This function requires RDKit to be installed.

deepchem.utils.rdkit_util.write_molecule(mol, outfile, is_protein=False)[source]

Write molecule to a file

This function writes a representation of the provided molecule to the specified outfile. Doesn’t return anything.

Parameters:
  • mol (rdkit Mol) – Molecule to write
  • outfile (str) – Filename to write mol to
  • is_protein (bool, optional) – Is this molecule a protein?

Note

This function requires RDKit to be installed.

Raises:ValueError: if outfile isn’t of a supported format.

Coordinate Box Utilities

class deepchem.utils.coordinate_box_utils.CoordinateBox(x_range, y_range, z_range)[source]

A coordinate box that represents a block in space.

Molecular complexes are typically represented with atoms as coordinate points. Each complex is naturally associated with a number of different box regions. For example, the bounding box is a box that contains all atoms in the molecular complex. A binding pocket box is a box that focuses in on a binding region of a protein to a ligand. A interface box is the region in which two proteins have a bulk interaction.

The CoordinateBox class is designed to represent such regions of space. It consists of the coordinates of the box, and the collection of atoms that live in this box alongside their coordinates.

__init__(x_range, y_range, z_range)[source]

Initialize this box.

Parameters:
  • x_range (tuple) – A tuple of (x_min, x_max) with max and min x-coordinates.
  • y_range (tuple) – A tuple of (y_min, y_max) with max and min y-coordinates.
  • z_range (tuple) – A tuple of (z_min, z_max) with max and min z-coordinates.
Raises:

ValueError if this interval is malformed

center()[source]

Computes the center of this box.

Returns:
Return type:(x, y, z) the coordinates of the center of the box.

Examples

>>> box = CoordinateBox((0, 1), (0, 1), (0, 1))
>>> box.center()
(0.5, 0.5, 0.5)
contains(other)[source]

Test whether this box contains another.

This method checks whether other is contained in this box.

Parameters:other (CoordinateBox) – The box to check is contained in this box.
Returns:
Return type:bool, True if other is contained in this box.
Raises:ValueError if not isinstance(other, CoordinateBox).
volume()[source]

Computes and returns the volume of this box.

Returns:
Return type:float, the volume of this box. Can be 0 if box is empty

Examples

>>> box = CoordinateBox((0, 1), (0, 1), (0, 1))
>>> box.volume()
1
deepchem.utils.coordinate_box_utils.intersect_interval(interval1, interval2)[source]

Computes the intersection of two intervals.

Parameters:
  • interval1 (tuple[int]) – Should be (x1_min, x1_max)
  • interval2 (tuple[int]) – Should be (x2_min, x2_max)
Returns:

x_intersect – Should be the intersection. If the intersection is empty returns (0, 0) to represent the empty set. Otherwise is (max(x1_min, x2_min), min(x1_max, x2_max)).

Return type:

tuple[int]

deepchem.utils.coordinate_box_utils.union(box1, box2)[source]

Merges provided boxes to find the smallest union box.

This method merges the two provided boxes.

Parameters:
  • box1 (CoordinateBox) – First box to merge in
  • box2 (CoordinateBox) – Second box to merge into this box
Returns:

Return type:

Smallest CoordinateBox that contains both box1 and box2

deepchem.utils.coordinate_box_utils.merge_overlapping_boxes(boxes, threshold=0.8)[source]

Merge boxes which have an overlap greater than threshold.

Parameters:
  • boxes (list[CoordinateBox]) – A list of CoordinateBox objects.
  • threshold (float, optional (default 0.8)) – The volume fraction of the boxes that must overlap for them to be merged together.
Returns:

  • list[CoordinateBox] of merged boxes. This list will have length less
  • than or equal to the length of boxes.

deepchem.utils.coordinate_box_utils.get_face_boxes(coords, pad=5)[source]

For each face of the convex hull, compute a coordinate box around it.

The convex hull of a macromolecule will have a series of triangular faces. For each such triangular face, we construct a bounding box around this triangle. Think of this box as attempting to capture some binding interaction region whose exterior is controlled by the box. Note that this box will likely be a crude approximation, but the advantage of this technique is that it only uses simple geometry to provide some basic biological insight into the molecule at hand.

The pad parameter is used to control the amount of padding around the face to be used for the coordinate box.

Parameters:
  • coords (np.ndarray) – Of shape (N, 3). The coordinates of a molecule.
  • pad (float, optional (default 5)) – The number of angstroms to pad.

Examples

>>> coords = np.array([[0, 0, 0], [1, 0, 0], [0, 1, 0], [0, 0, 1]])
>>> boxes = get_face_boxes(coords, pad=5)

Evaluation Utils

class deepchem.utils.evaluate.Evaluator(model, dataset, transformers, verbose=False)[source]

Class that evaluates a model on a given dataset.

__init__(model, dataset, transformers, verbose=False)[source]

Initialize self. See help(type(self)) for accurate signature.

compute_model_performance(metrics, csv_out=None, stats_out=None, per_task_metrics=False)[source]

Computes statistics of model on test data and saves results to csv.

Parameters:
  • metrics (list) – List of dc.metrics.Metric objects
  • csv_out (str, optional) – Filename to write CSV of model predictions.
  • stats_out (str, optional) – Filename to write computed statistics.
  • per_task_metrics (bool, optional) – If true, return computed metric for each task on multitask dataset.
output_predictions(y_preds, csv_out)[source]

Writes predictions to file.

Parameters:
  • y_preds – np.ndarray
  • csvfile – Open file object.
output_statistics(scores, stats_out)[source]

Write computed stats to file.

class deepchem.utils.evaluate.GeneratorEvaluator(model, generator, transformers, labels=None, weights=None)[source]

Partner class to Evaluator. Instead of operating over datasets this class operates over Generator. Evaluate a Metric over a model and Generator.

__init__(model, generator, transformers, labels=None, weights=None)[source]
Parameters:
  • model (Model) – Model to evaluate
  • generator (Generator) – Generator which yields batches to feed into the model. For a TensorGraph, each batch should be a dict mapping Layers to NumPy arrays. For a KerasModel, it should be a tuple of the form (inputs, labels, weights).
  • transformers – Tranformers to “undo” when applied to the models outputs
  • labels (list of Layer) – layers which are keys in the generator to compare to outputs
  • weights (list of Layer) – layers which are keys in the generator for weight matrices
compute_model_performance(metrics, per_task_metrics=False)[source]

Computes statistics of model on test data and saves results to csv.

Parameters:
  • metrics (list) – List of dc.metrics.Metric objects
  • per_task_metrics (bool, optional) – If true, return computed metric for each task on multitask dataset.
deepchem.utils.evaluate.relative_difference(x, y)[source]

Compute the relative difference between x and y

deepchem.utils.evaluate.threshold_predictions(y, threshold)[source]

Genomic Utilities

deepchem.utils.genomics.seq_one_hot_encode(sequences, letters='ATCGN')[source]

One hot encodes list of genomic sequences.

Sequences encoded have shape (N_sequences, N_letters, sequence_length, 1). These sequences will be processed as images with one color channel.

Parameters:
  • sequences (np.ndarray) – Array of genetic sequences
  • letters (str) – String with the set of possible letters in the sequences.
Raises:

ValueError: – If sequences are of different lengths.

Returns:

np.ndarray

Return type:

Shape (N_sequences, N_letters, sequence_length, 1)

deepchem.utils.genomics.encode_fasta_sequence(fname)[source]

Loads fasta file and returns an array of one-hot sequences.

Parameters:fname (str) – Filename of fasta file.
Returns:np.ndarray
Return type:Shape (N_sequences, 5, sequence_length, 1)
deepchem.utils.genomics.encode_bio_sequence(fname, file_type='fasta', letters='ATCGN')[source]

Loads a sequence file and returns an array of one-hot sequences.

Parameters:
  • fname (str) – Filename of fasta file.
  • file_type (str) – The type of file encoding to process, e.g. fasta or fastq, this is passed to Biopython.SeqIO.parse.
  • letters (str) – The set of letters that the sequences consist of, e.g. ATCG.
Returns:

np.ndarray

Return type:

Shape (N_sequences, N_letters, sequence_length, 1)

Geometry Utilities

deepchem.utils.geometry_utils.unit_vector(vector)[source]

Returns the unit vector of the vector.

deepchem.utils.geometry_utils.angle_between(vector_i, vector_j)[source]

Returns the angle in radians between vectors “vector_i” and “vector_j”

>>> print("%0.06f" % angle_between((1, 0, 0), (0, 1, 0)))
1.570796
>>> print("%0.06f" % angle_between((1, 0, 0), (1, 0, 0)))
0.000000
>>> print("%0.06f" % angle_between((1, 0, 0), (-1, 0, 0)))
3.141593

Note that this function always returns the smaller of the two angles between the vectors (value between 0 and pi).

deepchem.utils.geometry_utils.generate_random_unit_vector()[source]

Generate a random unit vector on the sphere S^2.

Citation: http://mathworld.wolfram.com/SpherePointPicking.html

Pseudocode:
  1. Choose random theta element [0, 2*pi]
  2. Choose random z element [-1, 1]
  3. Compute output vector u: (x,y,z) = (sqrt(1-z^2)*cos(theta), sqrt(1-z^2)*sin(theta),z)
deepchem.utils.geometry_utils.generate_random_rotation_matrix()[source]

Generates a random rotation matrix.

  1. Generate a random unit vector u, randomly sampled from the unit sphere (see function generate_random_unit_vector() for details)
  2. Generate a second random unit vector v
  1. If absolute value of u dot v > 0.99, repeat. (This is important for numerical stability. Intuition: we want them to be as linearly independent as possible or else the orthogonalized version of v will be much shorter in magnitude compared to u. I assume in Stack they took this from Gram-Schmidt orthogonalization?)
  2. v” = v - (u dot v)*u, i.e. subtract out the component of v that’s in u’s direction
  3. normalize v” (this isn”t in Stack but I assume it must be done)
  1. find w = u cross v”
  2. u, v”, and w will form the columns of a rotation matrix, R. The intuition is that u, v” and w are, respectively, what the standard basis vectors e1, e2, and e3 will be mapped to under the transformation.
Returns:R – R is of shape (3, 3)
Return type:np.ndarray
deepchem.utils.geometry_utils.is_angle_within_cutoff(vector_i, vector_j, angle_cutoff)[source]

A utility function to compute whether two vectors are within a cutoff from 180 degrees apart.

Parameters:
  • vector_i (np.ndarray) – Of shape (3,)
  • vector_j (np.ndarray) – Of shape (3,)
  • cutoff (float) – The deviation from 180 (in degrees)

Hash Function Utilities

deepchem.utils.hash_utils.hash_ecfp(ecfp, size)[source]

Returns an int < size representing given ECFP fragment.

Input must be a string. This utility function is used for various ECFP based fingerprints.

Parameters:
  • ecfp (str) – String to hash. Usually an ECFP fragment.
  • size (int, optional (default 1024)) – Hash to an int in range [0, size)
deepchem.utils.hash_utils.hash_ecfp_pair(ecfp_pair, size)[source]

Returns an int < size representing that ECFP pair.

Input must be a tuple of strings. This utility is primarily used for spatial contact featurizers. For example, if a protein and ligand have close contact region, the first string could be the protein’s fragment and the second the ligand’s fragment. The pair could be hashed together to achieve one hash value for this contact region.

Parameters:
  • ecfp_pair (tuple) – Pair of ECFP fragment strings
  • size (int, optional (default 1024)) – Hash to an int in range [0, size)
deepchem.utils.hash_utils.vectorize(hash_function, feature_dict=None, size=1024)[source]

Helper function to vectorize a spatial description from a hash.

Hash functions are used to perform spatial featurizations in DeepChem. However, it’s necessary to convert backwards from the hash function to feature vectors. This function aids in this conversion procedure. It creates a vector of zeros of length seize. It then loops through feature_dict, uses hash_function to hash the stored value to an integer in range [0, size) and bumps that index.

Parameters:
  • hash_function (function) – Should accept two arguments, feature, and size and return a hashed integer. Here feature is the item to hash, and size is an int. For example, if size=1024, then hashed values must fall in range [0, 1024).
  • feature_dict (dict) – Maps unique keys to features computed.
  • size (int, optional (default 1024)) – Length of generated bit vector

Voxel Utils

deepchem.utils.voxel_utils.convert_atom_to_voxel(coordinates, atom_index, box_width, voxel_width)[source]

Converts atom coordinates to an i,j,k grid index.

This function offsets molecular atom coordinates by (box_width/2, box_width/2, box_width/2) and then divides by voxel_width to compute the voxel indices.

Parameters:
  • coordinates (np.ndarray) – Array with coordinates of all atoms in the molecule, shape (N, 3).
  • atom_index (int) – Index of an atom in the molecule.
  • box_width (float) – Size of the box in Angstroms.
  • voxel_width (float) – Size of a voxel in Angstroms
Returns:

  • A list containing a numpy array of length 3 with [i, j, k], the
  • voxel coordinates of specified atom. This is returned a list so it
  • has the same API as convert_atom_pair_to_voxel

deepchem.utils.voxel_utils.convert_atom_pair_to_voxel(coordinates_tuple, atom_index_pair, box_width, voxel_width)[source]

Converts a pair of atoms to a list of i,j,k tuples.

Parameters:
  • coordinates_tuple (tuple) – A tuple containing two molecular coordinate arrays of shapes (N, 3) and (M, 3).
  • atom_index_pair (tuple) – A tuple of indices for the atoms in the two molecules.
  • box_width (float) – Size of the box in Angstroms.
  • voxel_width (float) – Size of a voxel in Angstroms
Returns:

  • A list containing two numpy array of length 3 with [i, j, k], the
  • voxel coordinates of specified atom.

deepchem.utils.voxel_utils.voxelize(get_voxels, box_width, voxel_width, hash_function, coordinates, feature_dict=None, feature_list=None, nb_channel=16, dtype='np.int8')[source]

Helper function to voxelize inputs.

This helper function helps convert a hash function which specifies spatial features of a molecular complex into a voxel tensor. This utility is used by various featurizers that generate voxel grids.

Parameters:
  • get_voxels (function) – Function that voxelizes inputs
  • box_width (float, optional (default 16.0)) – Size of a box in which voxel features are calculated. Box is centered on a ligand centroid.
  • voxel_width (float, optional (default 1.0)) – Size of a 3D voxel in a grid in Angstroms.
  • hash_function (function) – Used to map feature choices to voxel channels.
  • coordinates (np.ndarray) – Contains the 3D coordinates of a molecular system.
  • feature_dict (dict) – Keys are atom indices or tuples of atom indices, the values are computed features. If hash_function is not None, then the values are hashed using the hash function into [0, nb_channels) and this channel at the voxel for the given key is incremented by 1 for each dictionary entry. If hash_function is None, then the value must be a vector of size (n_channels,) which is added to the existing channel values at that voxel grid.
  • feature_list (list) – List of atom indices or tuples of atom indices. This can only be used if nb_channel==1. Increments the voxels corresponding to these indices by 1 for each entry.
  • nb_channel (int (Default 16)) – The number of feature channels computed per voxel. Should be a power of 2.
  • dtype (type) – The dtype of the numpy ndarray created to hold features.
Returns:

  • Tensor of shape (voxels_per_edge, voxels_per_edge,
  • voxels_per_edge, nb_channel),