Utilities¶
DeepChem has a broad collection of utility functions. Many of these maybe be of independent interest to users since they deal with some tricky aspects of processing scientific datatypes.
Data Utilities¶
Array Utilities¶
- pad_array(x: numpy.ndarray, shape: Union[Tuple, int], fill: float = 0.0, both: bool = False) numpy.ndarray [source]¶
Pad an array with a fill value.
- Parameters
x (np.ndarray) – A numpy array.
shape (Tuple or int) – Desired shape. If int, all dimensions are padded to that size.
fill (float, optional (default 0.0)) – The padded value.
both (bool, optional (default False)) – If True, split the padding on both sides of each axis. If False, padding is applied to the end of each axis.
- Returns
A padded numpy array
- Return type
np.ndarray
Data Directory¶
The DeepChem data directory is where downloaded MoleculeNet datasets are stored.
URL Handling¶
- download_url(url: str, dest_dir: str = '/tmp', name: Optional[str] = None)[source]¶
Download a file to disk.
- Parameters
url (str) – The URL to download from
dest_dir (str) – The directory to save the file in
name (str) – The file name to save it as. If omitted, it will try to extract a file name from the URL
File Handling¶
- untargz_file(file: str, dest_dir: str = '/tmp', name: Optional[str] = None)[source]¶
Untar and unzip a .tar.gz file to disk.
- Parameters
file (str) – The filepath to decompress
dest_dir (str) – The directory to save the file in
name (str) – The file name to save it as. If omitted, it will use the file name
- unzip_file(file: str, dest_dir: str = '/tmp', name: Optional[str] = None)[source]¶
Unzip a .zip file to disk.
- Parameters
file (str) – The filepath to decompress
dest_dir (str) – The directory to save the file in
name (str) – The directory name to unzip it to. If omitted, it will use the file name
- load_data(input_files: List[str], shard_size: Optional[int] = None) Iterator[Any] [source]¶
Loads data from files.
- Parameters
input_files (List[str]) – List of filenames.
shard_size (int, default None) – Size of shard to yield
- Returns
Iterator which iterates over provided files.
- Return type
Iterator[Any]
Notes
The supported file types are SDF, CSV and Pickle.
- load_sdf_files(input_files: List[str], clean_mols: bool = True, tasks: List[str] = [], shard_size: Optional[int] = None) Iterator[pandas.core.frame.DataFrame] [source]¶
Load SDF file into dataframe.
- Parameters
input_files (List[str]) – List of filenames
clean_mols (bool, default True) – Whether to sanitize molecules.
tasks (List[str], default []) – Each entry in tasks is treated as a property in the SDF file and is retrieved with mol.GetProp(str(task)) where mol is the RDKit mol loaded from a given SDF entry.
shard_size (int, default None) – The shard size to yield at one time.
- Returns
Generator which yields the dataframe which is the same shard size.
- Return type
Iterator[pd.DataFrame]
Notes
This function requires RDKit to be installed.
- load_csv_files(input_files: List[str], shard_size: Optional[int] = None) Iterator[pandas.core.frame.DataFrame] [source]¶
Load data as pandas dataframe from CSV files.
- Parameters
input_files (List[str]) – List of filenames
shard_size (int, default None) – The shard size to yield at one time.
- Returns
Generator which yields the dataframe which is the same shard size.
- Return type
Iterator[pd.DataFrame]
- load_json_files(input_files: List[str], shard_size: Optional[int] = None) Iterator[pandas.core.frame.DataFrame] [source]¶
Load data as pandas dataframe.
- Parameters
input_files (List[str]) – List of json filenames.
shard_size (int, default None) – Chunksize for reading json files.
- Returns
Generator which yields the dataframe which is the same shard size.
- Return type
Iterator[pd.DataFrame]
Notes
To load shards from a json file into a Pandas dataframe, the file must be originally saved with
df.to_json('filename.json', orient='records', lines=True)
- load_pickle_files(input_files: List[str]) Iterator[Any] [source]¶
Load dataset from pickle files.
- Parameters
input_files (List[str]) – The list of filenames of pickle file. This function can load from gzipped pickle file like XXXX.pkl.gz.
- Returns
Generator which yields the objects which is loaded from each pickle file.
- Return type
Iterator[Any]
- load_from_disk(filename: str) Any [source]¶
Load a dataset from file.
- Parameters
filename (str) – A filename you want to load data.
- Returns
A loaded object from file.
- Return type
Any
- save_to_disk(dataset: Any, filename: str, compress: int = 3)[source]¶
Save a dataset to file.
- Parameters
dataset (str) – A data saved
filename (str) – Path to save data.
compress (int, default 3) – The compress option when dumping joblib file.
- load_dataset_from_disk(save_dir: str) Tuple[bool, Optional[Tuple[deepchem.data.datasets.DiskDataset, deepchem.data.datasets.DiskDataset, deepchem.data.datasets.DiskDataset]], List[transformers.Transformer]] [source]¶
Loads MoleculeNet train/valid/test/transformers from disk.
Expects that data was saved using save_dataset_to_disk below. Expects the following directory structure for save_dir: save_dir/
—> train_dir/ | —> valid_dir/ | —> test_dir/ | —> transformers.pkl
- Parameters
save_dir (str) – Directory name to load datasets.
- Returns
loaded (bool) – Whether the load succeeded
all_dataset (Tuple[DiskDataset, DiskDataset, DiskDataset]) – The train, valid, test datasets
transformers (Transformer) – The transformers used for this dataset
See also
- save_dataset_to_disk(save_dir: str, train: deepchem.data.datasets.DiskDataset, valid: deepchem.data.datasets.DiskDataset, test: deepchem.data.datasets.DiskDataset, transformers: List[transformers.Transformer])[source]¶
Utility used by MoleculeNet to save train/valid/test datasets.
This utility function saves a train/valid/test split of a dataset along with transformers in the same directory. The saved datasets will take the following structure: save_dir/
—> train_dir/ | —> valid_dir/ | —> test_dir/ | —> transformers.pkl
- Parameters
save_dir (str) – Directory name to save datasets to.
train (DiskDataset) – Training dataset to save.
valid (DiskDataset) – Validation dataset to save.
test (DiskDataset) – Test dataset to save.
transformers (List[Transformer]) – List of transformers to save to disk.
See also
Molecular Utilities¶
- class ConformerGenerator(max_conformers: int = 1, rmsd_threshold: float = 0.5, force_field: str = 'uff', pool_multiplier: int = 10)[source]¶
Generate molecule conformers.
Notes
Procedure 1. Generate a pool of conformers. 2. Minimize conformers. 3. Prune conformers using an RMSD threshold.
Note that pruning is done _after_ minimization, which differs from the protocol described in the references [1]_ [2]_.
References
Notes
This class requires RDKit to be installed.
- __init__(max_conformers: int = 1, rmsd_threshold: float = 0.5, force_field: str = 'uff', pool_multiplier: int = 10)[source]¶
- Parameters
max_conformers (int, optional (default 1)) – Maximum number of conformers to generate (after pruning).
rmsd_threshold (float, optional (default 0.5)) – RMSD threshold for pruning conformers. If None or negative, no pruning is performed.
force_field (str, optional (default 'uff')) – Force field to use for conformer energy calculation and minimization. Options are ‘uff’, ‘mmff94’, and ‘mmff94s’.
pool_multiplier (int, optional (default 10)) – Factor to multiply by max_conformers to generate the initial conformer pool. Since conformers are pruned after energy minimization, increasing the size of the pool increases the chance of identifying max_conformers unique conformers.
- generate_conformers(mol: Any) Any [source]¶
Generate conformers for a molecule.
This function returns a copy of the original molecule with embedded conformers.
- Parameters
mol (rdkit.Chem.rdchem.Mol) – RDKit Mol object
- Returns
mol – A new RDKit Mol object containing the chosen conformers, sorted by increasing energy.
- Return type
rdkit.Chem.rdchem.Mol
- embed_molecule(mol: Any) Any [source]¶
Generate conformers, possibly with pruning.
- Parameters
mol (rdkit.Chem.rdchem.Mol) – RDKit Mol object
- Returns
mol – RDKit Mol object with embedded multiple conformers.
- Return type
rdkit.Chem.rdchem.Mol
- get_molecule_force_field(mol: Any, conf_id: Optional[int] = None, **kwargs) Any [source]¶
Get a force field for a molecule.
- Parameters
mol (rdkit.Chem.rdchem.Mol) – RDKit Mol object with embedded conformers.
conf_id (int, optional) – ID of the conformer to associate with the force field.
kwargs (dict, optional) – Keyword arguments for force field constructor.
- Returns
ff – RDKit force field instance for a molecule.
- Return type
rdkit.ForceField.rdForceField.ForceField
- minimize_conformers(mol: Any) None [source]¶
Minimize molecule conformers.
- Parameters
mol (rdkit.Chem.rdchem.Mol) – RDKit Mol object with embedded conformers.
- get_conformer_energies(mol: Any) numpy.ndarray [source]¶
Calculate conformer energies.
- Parameters
mol (rdkit.Chem.rdchem.Mol) – RDKit Mol object with embedded conformers.
- Returns
energies – Minimized conformer energies.
- Return type
np.ndarray
- prune_conformers(mol: Any) Any [source]¶
Prune conformers from a molecule using an RMSD threshold, starting with the lowest energy conformer.
- Parameters
mol (rdkit.Chem.rdchem.Mol) – RDKit Mol object
- Returns
new_mol – A new rdkit.Chem.rdchem.Mol containing the chosen conformers, sorted by increasing energy.
- Return type
rdkit.Chem.rdchem.Mol
- get_xyz_from_mol(mol)[source]¶
Extracts a numpy array of coordinates from a molecules.
Returns a (N, 3) numpy array of 3d coords of given rdkit molecule
- Parameters
mol (rdkit Molecule) – Molecule to extract coordinates for
- Return type
Numpy ndarray of shape (N, 3) where N = mol.GetNumAtoms().
- add_hydrogens_to_mol(mol, is_protein=False)[source]¶
Add hydrogens to a molecule object
- Parameters
mol (Rdkit Mol) – Molecule to hydrogenate
is_protein (bool, optional (default False)) – Whether this molecule is a protein.
- Return type
Rdkit Mol
Note
This function requires RDKit and PDBFixer to be installed.
- compute_charges(mol)[source]¶
Attempt to compute Gasteiger Charges on Mol
This also has the side effect of calculating charges on mol. The mol passed into this function has to already have been sanitized
- Parameters
mol (rdkit molecule) –
- Return type
No return since updates in place.
Note
This function requires RDKit to be installed.
- load_molecule(molecule_file, add_hydrogens=True, calc_charges=True, sanitize=True, is_protein=False)[source]¶
Converts molecule file to (xyz-coords, obmol object)
Given molecule_file, returns a tuple of xyz coords of molecule and an rdkit object representing that molecule in that order (xyz, rdkit_mol). This ordering convention is used in the code in a few places.
- Parameters
molecule_file (str) – filename for molecule
add_hydrogens (bool, optional (default True)) – If True, add hydrogens via pdbfixer
calc_charges (bool, optional (default True)) – If True, add charges via rdkit
sanitize (bool, optional (default False)) – If True, sanitize molecules via rdkit
is_protein (bool, optional (default False)) – If True`, this molecule is loaded as a protein. This flag will affect some of the cleanup procedures applied.
- Returns
Tuple (xyz, mol) if file contains single molecule. Else returns a
list of the tuples for the separate molecules in this list.
Note
This function requires RDKit to be installed.
- write_molecule(mol, outfile, is_protein=False)[source]¶
Write molecule to a file
This function writes a representation of the provided molecule to the specified outfile. Doesn’t return anything.
- Parameters
mol (rdkit Mol) – Molecule to write
outfile (str) – Filename to write mol to
is_protein (bool, optional) – Is this molecule a protein?
Note
This function requires RDKit to be installed.
- Raises
ValueError – if outfile isn’t of a supported format.:
Molecular Fragment Utilities¶
It’s often convenient to manipulate subsets of a molecule. The MolecularFragment
class aids in such manipulations.
- class MolecularFragment(atoms: Sequence[Any], coords: numpy.ndarray)[source]¶
A class that represents a fragment of a molecule.
It’s often convenient to represent a fragment of a molecule. For example, if two molecules form a molecular complex, it may be useful to create two fragments which represent the subsets of each molecule that’s close to the other molecule (in the contact region).
Ideally, we’d be able to do this in RDKit direct, but manipulating molecular fragments doesn’t seem to be supported functionality.
Examples
>>> import numpy as np >>> from rdkit import Chem >>> mol = Chem.MolFromSmiles("C") >>> coords = np.array([[0.0, 0.0, 0.0]]) >>> atom = mol.GetAtoms()[0] >>> fragment = MolecularFragment([atom], coords)
- __init__(atoms: Sequence[Any], coords: numpy.ndarray)[source]¶
Initialize this object.
- Parameters
atoms (Iterable[rdkit.Chem.rdchem.Atom]) – Each entry in this list should be a RDKit Atom.
coords (np.ndarray) – Array of locations for atoms of shape (N, 3) where N == len(atoms).
- GetAtoms() List[deepchem.utils.fragment_utils.AtomShim] [source]¶
Returns the list of atoms
- Returns
list of atoms in this fragment.
- Return type
List[AtomShim]
- class AtomShim(atomic_num: int, partial_charge: float, atom_coords: numpy.ndarray)[source]¶
This is a shim object wrapping an atom.
We use this class instead of raw RDKit atoms since manipulating a large number of rdkit Atoms seems to result in segfaults. Wrapping the basic information in an AtomShim seems to avoid issues.
- __init__(atomic_num: int, partial_charge: float, atom_coords: numpy.ndarray)[source]¶
Initialize this object
- Parameters
atomic_num (int) – Atomic number for this atom.
partial_charge (float) – The partial Gasteiger charge for this atom
atom_coords (np.ndarray) – Of shape (3,) with the coordinates of this atom
- GetAtomicNum() int [source]¶
Returns atomic number for this atom.
- Returns
Atomic number for this atom.
- Return type
int
- strip_hydrogens(coords: numpy.ndarray, mol: Union[Any, deepchem.utils.fragment_utils.MolecularFragment]) Tuple[numpy.ndarray, deepchem.utils.fragment_utils.MolecularFragment] [source]¶
Strip the hydrogens from input molecule
- Parameters
coords (np.ndarray) – The coords must be of shape (N, 3) and correspond to coordinates of mol.
mol (rdkit.Chem.rdchem.Mol or MolecularFragment) – The molecule to strip
- Returns
A tuple of (coords, mol_frag) where coords is a numpy array of coordinates with hydrogen coordinates. mol_frag is a MolecularFragment.
- Return type
Tuple[np.ndarray, MolecularFragment]
Notes
This function requires RDKit to be installed.
- merge_molecular_fragments(molecules: List[deepchem.utils.fragment_utils.MolecularFragment]) Optional[deepchem.utils.fragment_utils.MolecularFragment] [source]¶
Helper method to merge two molecular fragments.
- Parameters
molecules (List[MolecularFragment]) – List of MolecularFragment objects.
- Returns
Returns a merged MolecularFragment
- Return type
Optional[MolecularFragment]
- get_contact_atom_indices(fragments: List[Tuple[numpy.ndarray, Any]], cutoff: float = 4.5) List[List[int]] [source]¶
Compute that atoms close to contact region.
Molecular complexes can get very large. This can make it unwieldy to compute functions on them. To improve memory usage, it can be very useful to trim out atoms that aren’t close to contact regions. This function computes pairwise distances between all pairs of molecules in the molecular complex. If an atom is within cutoff distance of any atom on another molecule in the complex, it is regarded as a contact atom. Otherwise it is trimmed.
- Parameters
fragments (List[Tuple[np.ndarray, rdkit.Chem.rdchem.Mol]]) – As returned by rdkit_utils.load_complex, a list of tuples of (coords, mol) where coords is a (N_atoms, 3) array and mol is the rdkit molecule object.
cutoff (float, optional (default 4.5)) – The cutoff distance in angstroms.
- Returns
A list of length len(molecular_complex). Each entry in this list is a list of atom indices from that molecule which should be kept, in sorted order.
- Return type
List[List[int]]
- reduce_molecular_complex_to_contacts(fragments: List[Tuple[numpy.ndarray, Any]], cutoff: float = 4.5) List[Tuple[numpy.ndarray, deepchem.utils.fragment_utils.MolecularFragment]] [source]¶
Reduce a molecular complex to only those atoms near a contact.
Molecular complexes can get very large. This can make it unwieldy to compute functions on them. To improve memory usage, it can be very useful to trim out atoms that aren’t close to contact regions. This function takes in a molecular complex and returns a new molecular complex representation that contains only contact atoms. The contact atoms are computed by calling get_contact_atom_indices under the hood.
- Parameters
fragments (List[Tuple[np.ndarray, rdkit.Chem.rdchem.Mol]]) – As returned by rdkit_utils.load_complex, a list of tuples of (coords, mol) where coords is a (N_atoms, 3) array and mol is the rdkit molecule object.
cutoff (float) – The cutoff distance in angstroms.
- Returns
A list of length len(molecular_complex). Each entry in this list is a tuple of (coords, MolecularFragment). The coords is stripped down to (N_contact_atoms, 3) where N_contact_atoms is the number of contact atoms for this complex. MolecularFragment is used since it’s tricky to make a RDKit sub-molecule.
- Return type
List[Tuple[np.ndarray, MolecularFragment]]
Coordinate Box Utilities¶
- class CoordinateBox(x_range: Tuple[float, float], y_range: Tuple[float, float], z_range: Tuple[float, float])[source]¶
A coordinate box that represents a block in space.
Molecular complexes are typically represented with atoms as coordinate points. Each complex is naturally associated with a number of different box regions. For example, the bounding box is a box that contains all atoms in the molecular complex. A binding pocket box is a box that focuses in on a binding region of a protein to a ligand. A interface box is the region in which two proteins have a bulk interaction.
The CoordinateBox class is designed to represent such regions of space. It consists of the coordinates of the box, and the collection of atoms that live in this box alongside their coordinates.
- __init__(x_range: Tuple[float, float], y_range: Tuple[float, float], z_range: Tuple[float, float])[source]¶
Initialize this box.
- Parameters
x_range (Tuple[float, float]) – A tuple of (x_min, x_max) with max and min x-coordinates.
y_range (Tuple[float, float]) – A tuple of (y_min, y_max) with max and min y-coordinates.
z_range (Tuple[float, float]) – A tuple of (z_min, z_max) with max and min z-coordinates.
- Raises
ValueError –
- __contains__(point: Sequence[float]) bool [source]¶
Check whether a point is in this box.
- Parameters
point (Sequence[float]) – 3-tuple or list of length 3 or np.ndarray of shape (3,). The (x, y, z) coordinates of a point in space.
- Returns
True if other is contained in this box.
- Return type
bool
- center() Tuple[float, float, float] [source]¶
Computes the center of this box.
- Returns
(x, y, z) the coordinates of the center of the box.
- Return type
Tuple[float, float, float]
Examples
>>> box = CoordinateBox((0, 1), (0, 1), (0, 1)) >>> box.center() (0.5, 0.5, 0.5)
- volume() float [source]¶
Computes and returns the volume of this box.
- Returns
The volume of this box. Can be 0 if box is empty
- Return type
float
Examples
>>> box = CoordinateBox((0, 1), (0, 1), (0, 1)) >>> box.volume() 1
- contains(other: deepchem.utils.coordinate_box_utils.CoordinateBox) bool [source]¶
Test whether this box contains another.
This method checks whether other is contained in this box.
- Parameters
other (CoordinateBox) – The box to check is contained in this box.
- Returns
True if other is contained in this box.
- Return type
bool
- Raises
ValueError –
- intersect_interval(interval1: Tuple[float, float], interval2: Tuple[float, float]) Tuple[float, float] [source]¶
Computes the intersection of two intervals.
- Parameters
interval1 (Tuple[float, float]) – Should be (x1_min, x1_max)
interval2 (Tuple[float, float]) – Should be (x2_min, x2_max)
- Returns
x_intersect – Should be the intersection. If the intersection is empty returns (0, 0) to represent the empty set. Otherwise is (max(x1_min, x2_min), min(x1_max, x2_max)).
- Return type
Tuple[float, float]
- union(box1: deepchem.utils.coordinate_box_utils.CoordinateBox, box2: deepchem.utils.coordinate_box_utils.CoordinateBox) deepchem.utils.coordinate_box_utils.CoordinateBox [source]¶
Merges provided boxes to find the smallest union box.
This method merges the two provided boxes.
- Parameters
box1 (CoordinateBox) – First box to merge in
box2 (CoordinateBox) – Second box to merge into this box
- Returns
Smallest CoordinateBox that contains both box1 and box2
- Return type
- merge_overlapping_boxes(boxes: List[deepchem.utils.coordinate_box_utils.CoordinateBox], threshold: float = 0.8) List[deepchem.utils.coordinate_box_utils.CoordinateBox] [source]¶
Merge boxes which have an overlap greater than threshold.
- Parameters
boxes (list[CoordinateBox]) – A list of CoordinateBox objects.
threshold (float, default 0.8) – The volume fraction of the boxes that must overlap for them to be merged together.
- Returns
List[CoordinateBox] of merged boxes. This list will have length less than or equal to the length of boxes.
- Return type
List[CoordinateBox]
- get_face_boxes(coords: numpy.ndarray, pad: float = 5.0) List[deepchem.utils.coordinate_box_utils.CoordinateBox] [source]¶
For each face of the convex hull, compute a coordinate box around it.
The convex hull of a macromolecule will have a series of triangular faces. For each such triangular face, we construct a bounding box around this triangle. Think of this box as attempting to capture some binding interaction region whose exterior is controlled by the box. Note that this box will likely be a crude approximation, but the advantage of this technique is that it only uses simple geometry to provide some basic biological insight into the molecule at hand.
The pad parameter is used to control the amount of padding around the face to be used for the coordinate box.
- Parameters
coords (np.ndarray) – A numpy array of shape (N, 3). The coordinates of a molecule.
pad (float, optional (default 5.0)) – The number of angstroms to pad.
- Returns
boxes – List of CoordinateBox
- Return type
List[CoordinateBox]
Examples
>>> coords = np.array([[0, 0, 0], [1, 0, 0], [0, 1, 0], [0, 0, 1]]) >>> boxes = get_face_boxes(coords, pad=5)
Evaluation Utils¶
- class Evaluator(model, dataset: deepchem.data.datasets.Dataset, transformers: List[transformers.Transformer])[source]¶
Class that evaluates a model on a given dataset.
The evaluator class is used to evaluate a dc.models.Model class on a given dc.data.Dataset object. The evaluator is aware of dc.trans.Transformer objects so will automatically undo any transformations which have been applied.
Examples
Evaluators allow for a model to be evaluated directly on a Metric for sklearn. Let’s do a bit of setup constructing our dataset and model.
>>> import deepchem as dc >>> import numpy as np >>> X = np.random.rand(10, 5) >>> y = np.random.rand(10, 1) >>> dataset = dc.data.NumpyDataset(X, y) >>> model = dc.models.MultitaskRegressor(1, 5) >>> transformers = []
Then you can evaluate this model as follows >>> import sklearn >>> evaluator = Evaluator(model, dataset, transformers) >>> multitask_scores = evaluator.compute_model_performance( … sklearn.metrics.mean_absolute_error)
Evaluators can also be used with dc.metrics.Metric objects as well in case you want to customize your metric further.
>>> evaluator = Evaluator(model, dataset, transformers) >>> metric = dc.metrics.Metric(dc.metrics.mae_score) >>> multitask_scores = evaluator.compute_model_performance(metric)
- __init__(model, dataset: deepchem.data.datasets.Dataset, transformers: List[transformers.Transformer])[source]¶
Initialize this evaluator
- Parameters
model (Model) – Model to evaluate. Note that this must be a regression or classification model and not a generative model.
dataset (Dataset) – Dataset object to evaluate model on.
transformers (List[Transformer]) – List of dc.trans.Transformer objects. These transformations must have been applied to dataset previously. The dataset will be untransformed for metric evaluation.
- output_statistics(scores: Dict[str, float], stats_out: str)[source]¶
Write computed stats to file.
- Parameters
scores (dict) – Dictionary mapping names of metrics to scores.
stats_out (str) – Name of file to write scores to.
- output_predictions(y_preds: numpy.ndarray, csv_out: str)[source]¶
Writes predictions to file.
Writes predictions made on self.dataset to a specified file on disk. self.dataset.ids are used to format predictions.
- Parameters
y_preds (np.ndarray) – Predictions to output
csv_out (str) – Name of file to write predictions to.
- compute_model_performance(metrics: Union[deepchem.metrics.metric.Metric, Callable[[...], Any], List[deepchem.metrics.metric.Metric], List[Callable[[...], Any]]], csv_out: Optional[str] = None, stats_out: Optional[str] = None, per_task_metrics: bool = False, use_sample_weights: bool = False, n_classes: int = 2) Union[Dict[str, float], Tuple[Dict[str, float], Dict[str, float]]] [source]¶
Computes statistics of model on test data and saves results to csv.
- Parameters
metrics (dc.metrics.Metric/list[dc.metrics.Metric]/function) – The set of metrics provided. This class attempts to do some intelligent handling of input. If a single dc.metrics.Metric object is provided or a list is provided, it will evaluate self.model on these metrics. If a function is provided, it is assumed to be a metric function that this method will attempt to wrap in a dc.metrics.Metric object. A metric function must accept two arguments, y_true, y_pred both of which are np.ndarray objects and return a floating point score. The metric function may also accept a keyword argument sample_weight to account for per-sample weights.
csv_out (str, optional (DEPRECATED)) – Filename to write CSV of model predictions.
stats_out (str, optional (DEPRECATED)) – Filename to write computed statistics.
per_task_metrics (bool, optional) – If true, return computed metric for each task on multitask dataset.
use_sample_weights (bool, optional (default False)) – If set, use per-sample weights w.
n_classes (int, optional (default None)) – If specified, will use n_classes as the number of unique classes in self.dataset. Note that this argument will be ignored for regression metrics.
- Returns
multitask_scores (dict) – Dictionary mapping names of metrics to metric scores.
all_task_scores (dict, optional) – If per_task_metrics == True, then returns a second dictionary of scores for each task separately.
- class GeneratorEvaluator(model, generator: Iterable[Tuple[Any, Any, Any]], transformers: List[transformers.Transformer], labels: Optional[List] = None, weights: Optional[List] = None)[source]¶
Evaluate models on a stream of data.
This class is a partner class to Evaluator. Instead of operating over datasets this class operates over a generator which yields batches of data to feed into provided model.
Examples
>>> import deepchem as dc >>> import numpy as np >>> X = np.random.rand(10, 5) >>> y = np.random.rand(10, 1) >>> dataset = dc.data.NumpyDataset(X, y) >>> model = dc.models.MultitaskRegressor(1, 5) >>> generator = model.default_generator(dataset, pad_batches=False) >>> transformers = []
Then you can evaluate this model as follows
>>> import sklearn >>> evaluator = GeneratorEvaluator(model, generator, transformers) >>> multitask_scores = evaluator.compute_model_performance( ... sklearn.metrics.mean_absolute_error)
Evaluators can also be used with dc.metrics.Metric objects as well in case you want to customize your metric further. (Note that a given generator can only be used once so we have to redefine the generator here.)
>>> generator = model.default_generator(dataset, pad_batches=False) >>> evaluator = GeneratorEvaluator(model, generator, transformers) >>> metric = dc.metrics.Metric(dc.metrics.mae_score) >>> multitask_scores = evaluator.compute_model_performance(metric)
- __init__(model, generator: Iterable[Tuple[Any, Any, Any]], transformers: List[transformers.Transformer], labels: Optional[List] = None, weights: Optional[List] = None)[source]¶
- Parameters
model (Model) – Model to evaluate.
generator (generator) – Generator which yields batches to feed into the model. For a KerasModel, it should be a tuple of the form (inputs, labels, weights). The “correct” way to create this generator is to use model.default_generator as shown in the example above.
transformers (List[Transformer]) – Tranformers to “undo” when applied to the models outputs
labels (list of Layer) – layers which are keys in the generator to compare to outputs
weights (list of Layer) – layers which are keys in the generator for weight matrices
- compute_model_performance(metrics: Union[deepchem.metrics.metric.Metric, Callable[[...], Any], List[deepchem.metrics.metric.Metric], List[Callable[[...], Any]]], per_task_metrics: bool = False, use_sample_weights: bool = False, n_classes: int = 2) Union[Dict[str, float], Tuple[Dict[str, float], Dict[str, float]]] [source]¶
Computes statistics of model on test data and saves results to csv.
- Parameters
metrics (dc.metrics.Metric/list[dc.metrics.Metric]/function) – The set of metrics provided. This class attempts to do some intelligent handling of input. If a single dc.metrics.Metric object is provided or a list is provided, it will evaluate self.model on these metrics. If a function is provided, it is assumed to be a metric function that this method will attempt to wrap in a dc.metrics.Metric object. A metric function must accept two arguments, y_true, y_pred both of which are np.ndarray objects and return a floating point score.
per_task_metrics (bool, optional) – If true, return computed metric for each task on multitask dataset.
use_sample_weights (bool, optional (default False)) – If set, use per-sample weights w.
n_classes (int, optional (default None)) – If specified, will assume that all metrics are classification metrics and will use n_classes as the number of unique classes in self.dataset.
- Returns
multitask_scores (dict) – Dictionary mapping names of metrics to metric scores.
all_task_scores (dict, optional) – If per_task_metrics == True, then returns a second dictionary of scores for each task separately.
- relative_difference(x: numpy.ndarray, y: numpy.ndarray) numpy.ndarray [source]¶
Compute the relative difference between x and y
The two argument arrays must have the same shape.
- Parameters
x (np.ndarray) – First input array
y (np.ndarray) – Second input array
- Returns
z – We will have z == np.abs(x-y) / np.abs(max(x, y)).
- Return type
np.ndarray
Genomic Utilities¶
- seq_one_hot_encode(sequences, letters: str = 'ATCGN') numpy.ndarray [source]¶
One hot encodes list of genomic sequences.
Sequences encoded have shape (N_sequences, N_letters, sequence_length, 1). These sequences will be processed as images with one color channel.
- Parameters
sequences (np.ndarray or Iterator[Bio.SeqRecord]) – Iterable object of genetic sequences
letters (str, optional (default "ATCGN")) – String with the set of possible letters in the sequences.
- Raises
ValueError: – If sequences are of different lengths.
- Returns
A numpy array of shape (N_sequences, N_letters, sequence_length, 1).
- Return type
np.ndarray
- encode_bio_sequence(fname: str, file_type: str = 'fasta', letters: str = 'ATCGN') numpy.ndarray [source]¶
Loads a sequence file and returns an array of one-hot sequences.
- Parameters
fname (str) – Filename of fasta file.
file_type (str, optional (default "fasta")) – The type of file encoding to process, e.g. fasta or fastq, this is passed to Biopython.SeqIO.parse.
letters (str, optional (default "ATCGN")) – The set of letters that the sequences consist of, e.g. ATCG.
- Returns
A numpy array of shape (N_sequences, N_letters, sequence_length, 1).
- Return type
np.ndarray
Notes
This function requires BioPython to be installed.
- hhblits(dataset_path, database=None, data_dir=None, evalue=0.001, num_iterations=2, num_threads=4)[source]¶
Run hhblits multisequence alignment search on a dataset. This function requires the hhblits binary to be installed and in the path. This function also requires a Hidden Markov Model reference database to be provided. Both can be found here: https://github.com/soedinglab/hh-suite
The database should be in the deepchem data directory or specified as an argument. To set the deepchem data directory, run this command in your environment:
export DEEPCHEM_DATA_DIR=<path to data directory>
- Parameters
dataset_path (str) – Path to single sequence or multiple sequence alignment (MSA) dataset. Results will be saved in this directory.
database (str) – Name of database to search against. Note this is not the path, but the name of the database.
data_dir (str) – Path to database directory.
evalue (float) – E-value cutoff.
num_iterations (int) – Number of iterations.
num_threads (int) – Number of threads.
- Returns
results (.a3m file) – MSA file containing the results of the hhblits search.
results (.hhr file) – hhsuite results file containing the results of the hhblits search.
Examples
>>> from deepchem.utils.sequence_utils import hhblits >>> msa_path = hhblits('test/data/example.fasta', database='example_db', data_dir='test/data/', evalue=0.001, num_iterations=2, num_threads=4)
- hhsearch(dataset_path, database=None, data_dir=None, evalue=0.001, num_iterations=2, num_threads=4)[source]¶
Run hhsearch multisequence alignment search on a dataset. This function requires the hhblits binary to be installed and in the path. This function also requires a Hidden Markov Model reference database to be provided. Both can be found here: https://github.com/soedinglab/hh-suite
The database should be in the deepchem data directory or specified as an argument. To set the deepchem data directory, run this command in your environment:
export DEEPCHEM_DATA_DIR=<path to data directory>
Examples
>>> from deepchem.utils.sequence_utils import hhsearch >>> msa_path = hhsearch('test/data/example.fasta', database='example_db', data_dir='test/data/', evalue=0.001, num_iterations=2, num_threads=4)
- Parameters
dataset_path (str) – Path to multiple sequence alignment dataset. Results will be saved in this directory.
database (str) – Name of database to search against. Note this is not the path, but the name of the database.
data_dir (str) – Path to database directory.
evalue (float) – E-value cutoff.
num_iterations (int) – Number of iterations.
num_threads (int) – Number of threads.
- Returns
results (.a3m file) – MSA file containing the results of the hhblits search.
results (.hhr file) – hhsuite results file containing the results of the hhblits search.
Geometry Utilities¶
- unit_vector(vector: numpy.ndarray) numpy.ndarray [source]¶
Returns the unit vector of the vector.
- Parameters
vector (np.ndarray) – A numpy array of shape (3,), where 3 is (x,y,z).
- Returns
A numpy array of shape (3,). The unit vector of the input vector.
- Return type
np.ndarray
- angle_between(vector_i: numpy.ndarray, vector_j: numpy.ndarray) float [source]¶
Returns the angle in radians between vectors “vector_i” and “vector_j”
Note that this function always returns the smaller of the two angles between the vectors (value between 0 and pi).
- Parameters
vector_i (np.ndarray) – A numpy array of shape (3,), where 3 is (x,y,z).
vector_j (np.ndarray) – A numpy array of shape (3,), where 3 is (x,y,z).
- Returns
The angle in radians between the two vectors.
- Return type
np.ndarray
Examples
>>> print("%0.06f" % angle_between((1, 0, 0), (0, 1, 0))) 1.570796 >>> print("%0.06f" % angle_between((1, 0, 0), (1, 0, 0))) 0.000000 >>> print("%0.06f" % angle_between((1, 0, 0), (-1, 0, 0))) 3.141593
- generate_random_unit_vector() numpy.ndarray [source]¶
Generate a random unit vector on the sphere S^2.
Citation: http://mathworld.wolfram.com/SpherePointPicking.html
- Pseudocode:
Choose random theta element [0, 2*pi]
Choose random z element [-1, 1]
Compute output vector u: (x,y,z) = (sqrt(1-z^2)*cos(theta), sqrt(1-z^2)*sin(theta),z)
- Returns
u – A numpy array of shape (3,). u is an unit vector
- Return type
np.ndarray
- generate_random_rotation_matrix() numpy.ndarray [source]¶
Generates a random rotation matrix.
Generate a random unit vector u, randomly sampled from the
unit sphere (see function generate_random_unit_vector() for details)
Generate a second random unit vector v
If absolute value of u dot v > 0.99, repeat.
(This is important for numerical stability. Intuition: we want them to be as linearly independent as possible or else the orthogonalized version of v will be much shorter in magnitude compared to u. I assume in Stack they took this from Gram-Schmidt orthogonalization?)
v” = v - (u dot v)*u, i.e. subtract out the component of
v that’s in u’s direction
normalize v” (this isn”t in Stack but I assume it must be
done)
find w = u cross v”
u, v”, and w will form the columns of a rotation matrix, R.
The intuition is that u, v” and w are, respectively, what the standard basis vectors e1, e2, and e3 will be mapped to under the transformation.
- Returns
R – A numpy array of shape (3, 3). R is a rotation matrix.
- Return type
np.ndarray
- is_angle_within_cutoff(vector_i: numpy.ndarray, vector_j: numpy.ndarray, angle_cutoff: float) bool [source]¶
A utility function to compute whether two vectors are within a cutoff from 180 degrees apart.
- Parameters
vector_i (np.ndarray) – A numpy array of shape (3,)`, where 3 is (x,y,z).
vector_j (np.ndarray) – A numpy array of shape (3,), where 3 is (x,y,z).
cutoff (float) – The deviation from 180 (in degrees)
- Returns
Whether two vectors are within a cutoff from 180 degrees apart
- Return type
bool
Graph Utilities¶
- fourier_encode_dist(x, num_encodings=4, include_self=True)[source]¶
Fourier encode the input tensor x based on the specified number of encodings.
This function applies a Fourier encoding to the input tensor x by dividing it by a range of scales (2^i for i in range(num_encodings)) and then concatenating the sine and cosine of the scaled values. Optionally, the original input tensor can be included in the output.
- Parameters
x (torch.Tensor) – Input tensor to be Fourier encoded.
num_encodings (int, optional, default=4) – Number of Fourier encodings to apply.
include_self (bool, optional, default=True) – Whether to include the original input tensor in the output.
- Returns
Fourier encoded tensor.
- Return type
torch.Tensor
Examples
>>> import torch >>> x = torch.tensor([1.0, 2.0, 3.0]) >>> encoded_x = fourier_encode_dist(x, num_encodings=4, include_self=True)
- aggregate_mean(h, **kwargs)[source]¶
Compute the mean of the input tensor along the second to last dimension.
- Parameters
h (torch.Tensor) – Input tensor.
- Returns
Mean of the input tensor along the second to last dimension.
- Return type
torch.Tensor
- aggregate_max(h, **kwargs)[source]¶
Compute the max of the input tensor along the second to last dimension.
- Parameters
h (torch.Tensor) – Input tensor.
- Returns
Max of the input tensor along the second to last dimension.
- Return type
torch.Tensor
- aggregate_min(h, **kwargs)[source]¶
Compute the min of the input tensor along the second to last dimension.
- Parameters
h (torch.Tensor) – Input tensor.
**kwargs – Additional keyword arguments.
- Returns
Min of the input tensor along the second to last dimension.
- Return type
torch.Tensor
- aggregate_std(h, **kwargs)[source]¶
Compute the standard deviation of the input tensor along the second to last dimension.
- Parameters
h (torch.Tensor) – Input tensor.
- Returns
Standard deviation of the input tensor along the second to last dimension.
- Return type
torch.Tensor
- aggregate_var(h, **kwargs)[source]¶
Compute the variance of the input tensor along the second to last dimension.
- Parameters
h (torch.Tensor) – Input tensor.
- Returns
Variance of the input tensor along the second to last dimension.
- Return type
torch.Tensor
- aggregate_moment(h, n=3, **kwargs)[source]¶
Compute the nth moment of the input tensor along the second to last dimension.
- Parameters
h (torch.Tensor) – Input tensor.
n (int, optional, default=3) – The order of the moment to compute.
- Returns
Nth moment of the input tensor along the second to last dimension.
- Return type
torch.Tensor
- aggregate_sum(h, **kwargs)[source]¶
Compute the sum of the input tensor along the second to last dimension.
- Parameters
h (torch.Tensor) – Input tensor.
- Returns
Sum of the input tensor along the second to last dimension.
- Return type
torch.Tensor
- scale_identity(h, D=None, avg_d=None)[source]¶
Identity scaling function.
- Parameters
h (torch.Tensor) – Input tensor.
D (torch.Tensor, optional) – Degree tensor.
avg_d (dict, optional) – Dictionary containing averages over the training set.
- Returns
Scaled input tensor.
- Return type
torch.Tensor
- scale_amplification(h, D, avg_d)[source]¶
Amplification scaling function. log(D + 1) / d * h where d is the average of the
log(D + 1)
in the training set- Parameters
h (torch.Tensor) – Input tensor.
D (torch.Tensor) – Degree tensor.
avg_d (dict) – Dictionary containing averages over the training set.
- Returns
Scaled input tensor.
- Return type
torch.Tensor
- scale_attenuation(h, D, avg_d)[source]¶
Attenuation scaling function. (log(D + 1))^-1 / d * X where d is the average of the
log(D + 1))^-1
in the training set- Parameters
h (torch.Tensor) – Input tensor.
D (torch.Tensor) – Degree tensor.
avg_d (dict) – Dictionary containing averages over the training set.
- Returns
Scaled input tensor.
- Return type
torch.Tensor
Hash Function Utilities¶
- hash_ecfp(ecfp: str, size: int = 1024) int [source]¶
Returns an int < size representing given ECFP fragment.
Input must be a string. This utility function is used for various ECFP based fingerprints.
- Parameters
ecfp (str) – String to hash. Usually an ECFP fragment.
size (int, optional (default 1024)) – Hash to an int in range [0, size)
- Returns
ecfp_hash – An int < size representing given ECFP fragment
- Return type
int
- hash_ecfp_pair(ecfp_pair: Tuple[str, str], size: int = 1024) int [source]¶
Returns an int < size representing that ECFP pair.
Input must be a tuple of strings. This utility is primarily used for spatial contact featurizers. For example, if a protein and ligand have close contact region, the first string could be the protein’s fragment and the second the ligand’s fragment. The pair could be hashed together to achieve one hash value for this contact region.
- Parameters
ecfp_pair (Tuple[str, str]) – Pair of ECFP fragment strings
size (int, optional (default 1024)) – Hash to an int in range [0, size)
- Returns
ecfp_hash – An int < size representing given ECFP pair.
- Return type
int
- vectorize(hash_function: Callable[[Any, int], int], feature_dict: Optional[Dict[int, str]] = None, size: int = 1024, feature_list: Optional[List] = None) numpy.ndarray [source]¶
Helper function to vectorize a spatial description from a hash.
Hash functions are used to perform spatial featurizations in DeepChem. However, it’s necessary to convert backwards from the hash function to feature vectors. This function aids in this conversion procedure. It creates a vector of zeros of length size. It then loops through feature_dict, uses hash_function to hash the stored value to an integer in range [0, size) and bumps that index.
- Parameters
hash_function (Function, Callable[[str, int], int]) – Should accept two arguments, feature, and size and return a hashed integer. Here feature is the item to hash, and size is an int. For example, if size=1024, then hashed values must fall in range [0, 1024).
feature_dict (Dict, optional (default None)) – Maps unique keys to features computed.
size (int (default 1024)) – Length of generated bit vector
feature_list (List, optional (default None)) – List of features.
- Returns
feature_vector – A numpy array of shape (size,)
- Return type
np.ndarray
Voxel Utils¶
- convert_atom_to_voxel(coordinates: numpy.ndarray, atom_index: int, box_width: float, voxel_width: float) numpy.ndarray [source]¶
Converts atom coordinates to an i,j,k grid index.
This function offsets molecular atom coordinates by (box_width/2, box_width/2, box_width/2) and then divides by voxel_width to compute the voxel indices.
- Parameters
coordinates (np.ndarray) – Array with coordinates of all atoms in the molecule, shape (N, 3).
atom_index (int) – Index of an atom in the molecule.
box_width (float) – Size of the box in Angstroms.
voxel_width (float) – Size of a voxel in Angstroms
- Returns
indices – A 1D numpy array of length 3 with [i, j, k], the voxel coordinates of specified atom.
- Return type
np.ndarray
- convert_atom_pair_to_voxel(coordinates_tuple: Tuple[numpy.ndarray, numpy.ndarray], atom_index_pair: Tuple[int, int], box_width: float, voxel_width: float) numpy.ndarray [source]¶
Converts a pair of atoms to i,j,k grid indexes.
- Parameters
coordinates_tuple (Tuple[np.ndarray, np.ndarray]) – A tuple containing two molecular coordinate arrays of shapes (N, 3) and (M, 3).
atom_index_pair (Tuple[int, int]) – A tuple of indices for the atoms in the two molecules.
box_width (float) – Size of the box in Angstroms.
voxel_width (float) – Size of a voxel in Angstroms
- Returns
indices_list – A numpy array of shape (2, 3), where 3 is [i, j, k] of the voxel coordinates of specified atom.
- Return type
np.ndarray
- voxelize(get_voxels: Callable[[...], Any], coordinates: Any, box_width: float = 16.0, voxel_width: float = 1.0, hash_function: Optional[Callable[[...], Any]] = None, feature_dict: Optional[Dict[Any, Any]] = None, feature_list: Optional[List[Union[int, Tuple[int]]]] = None, nb_channel: int = 16, dtype: str = 'int') numpy.ndarray [source]¶
Helper function to voxelize inputs.
This helper function helps convert a hash function which specifies spatial features of a molecular complex into a voxel tensor. This utility is used by various featurizers that generate voxel grids.
- Parameters
get_voxels (Function) – Function that voxelizes inputs
coordinates (Any) – Contains the 3D coordinates of a molecular system. This should have whatever type get_voxels() expects as its first argument.
box_width (float, optional (default 16.0)) – Size of a box in which voxel features are calculated. Box is centered on a ligand centroid.
voxel_width (float, optional (default 1.0)) – Size of a 3D voxel in a grid in Angstroms.
hash_function (Function) – Used to map feature choices to voxel channels.
feature_dict (Dict, optional (default None)) – Keys are atom indices or tuples of atom indices, the values are computed features. If hash_function is not None, then the values are hashed using the hash function into [0, nb_channels) and this channel at the voxel for the given key is incremented by 1 for each dictionary entry. If hash_function is None, then the value must be a vector of size (n_channels,) which is added to the existing channel values at that voxel grid.
feature_list (List, optional (default None)) – List of atom indices or tuples of atom indices. This can only be used if nb_channel==1. Increments the voxels corresponding to these indices by 1 for each entry.
nb_channel (int, , optional (default 16)) – The number of feature channels computed per voxel. Should be a power of 2.
dtype (str ('int' or 'float'), optional (default 'int')) – The type of the numpy ndarray created to hold features.
- Returns
feature_tensor – The voxel of the input with the shape (voxels_per_edge, voxels_per_edge, voxels_per_edge, nb_channel).
- Return type
np.ndarray
Graph Convolution Utilities¶
- one_hot_encode(val: Union[int, str], allowable_set: Union[List[str], List[int]], include_unknown_set: bool = False) List[float] [source]¶
One hot encoder for elements of a provided set.
Examples
>>> one_hot_encode("a", ["a", "b", "c"]) [1.0, 0.0, 0.0] >>> one_hot_encode(2, [0, 1, 2]) [0.0, 0.0, 1.0] >>> one_hot_encode(3, [0, 1, 2]) [0.0, 0.0, 0.0] >>> one_hot_encode(3, [0, 1, 2], True) [0.0, 0.0, 0.0, 1.0]
- Parameters
val (int or str) – The value must be present in allowable_set.
allowable_set (List[int] or List[str]) – List of allowable quantities.
include_unknown_set (bool, default False) – If true, the index of all values not in allowable_set is len(allowable_set).
- Returns
An one-hot vector of val. If include_unknown_set is False, the length is len(allowable_set). If include_unknown_set is True, the length is len(allowable_set) + 1.
- Return type
List[float]
- Raises
ValueError – If include_unknown_set is False and val is not in allowable_set.
- get_atom_type_one_hot(atom: Any, allowable_set: List[str] = ['C', 'N', 'O', 'F', 'P', 'S', 'Cl', 'Br', 'I'], include_unknown_set: bool = True) List[float] [source]¶
Get an one-hot feature of an atom type.
- Parameters
atom (rdkit.Chem.rdchem.Atom) – RDKit atom object
allowable_set (List[str]) – The atom types to consider. The default set is [“C”, “N”, “O”, “F”, “P”, “S”, “Cl”, “Br”, “I”].
include_unknown_set (bool, default True) – If true, the index of all atom not in allowable_set is len(allowable_set).
- Returns
An one-hot vector of atom types. If include_unknown_set is False, the length is len(allowable_set). If include_unknown_set is True, the length is len(allowable_set) + 1.
- Return type
List[float]
- construct_hydrogen_bonding_info(mol: Any) List[Tuple[int, str]] [source]¶
Construct hydrogen bonding infos about a molecule.
- Parameters
mol (rdkit.Chem.rdchem.Mol) – RDKit mol object
- Returns
A list of tuple (atom_index, hydrogen_bonding_type). The hydrogen_bonding_type value is “Acceptor” or “Donor”.
- Return type
List[Tuple[int, str]]
- get_atom_hydrogen_bonding_one_hot(atom: Any, hydrogen_bonding: List[Tuple[int, str]]) List[float] [source]¶
Get an one-hot feat about whether an atom accepts electrons or donates electrons.
- Parameters
atom (rdkit.Chem.rdchem.Atom) – RDKit atom object
hydrogen_bonding (List[Tuple[int, str]]) – The return value of construct_hydrogen_bonding_info. The value is a list of tuple (atom_index, hydrogen_bonding) like (1, “Acceptor”).
- Returns
A one-hot vector of the ring size type. The first element indicates “Donor”, and the second element indicates “Acceptor”.
- Return type
List[float]
- get_atom_is_in_aromatic_one_hot(atom: Any) List[float] [source]¶
Get ans one-hot feature about whether an atom is in aromatic system or not.
- Parameters
atom (rdkit.Chem.rdchem.Atom) – RDKit atom object
- Returns
A vector of whether an atom is in aromatic system or not.
- Return type
List[float]
- get_atom_hybridization_one_hot(atom: Any, allowable_set: List[str] = ['SP', 'SP2', 'SP3'], include_unknown_set: bool = False) List[float] [source]¶
Get an one-hot feature of hybridization type.
- Parameters
atom (rdkit.Chem.rdchem.Atom) – RDKit atom object
allowable_set (List[str]) – The hybridization types to consider. The default set is [“SP”, “SP2”, “SP3”]
include_unknown_set (bool, default False) – If true, the index of all types not in allowable_set is len(allowable_set).
- Returns
An one-hot vector of the hybridization type. If include_unknown_set is False, the length is len(allowable_set). If include_unknown_set is True, the length is len(allowable_set) + 1.
- Return type
List[float]
- get_atom_total_num_Hs_one_hot(atom: Any, allowable_set: List[int] = [0, 1, 2, 3, 4], include_unknown_set: bool = True) List[float] [source]¶
Get an one-hot feature of the number of hydrogens which an atom has.
- Parameters
atom (rdkit.Chem.rdchem.Atom) – RDKit atom object
allowable_set (List[int]) – The number of hydrogens to consider. The default set is [0, 1, …, 4]
include_unknown_set (bool, default True) – If true, the index of all types not in allowable_set is len(allowable_set).
- Returns
A one-hot vector of the number of hydrogens which an atom has. If include_unknown_set is False, the length is len(allowable_set). If include_unknown_set is True, the length is len(allowable_set) + 1.
- Return type
List[float]
- get_atom_chirality_one_hot(atom: Any) List[float] [source]¶
Get an one-hot feature about an atom chirality type.
- Parameters
atom (rdkit.Chem.rdchem.Atom) – RDKit atom object
- Returns
A one-hot vector of the chirality type. The first element indicates “R”, and the second element indicates “S”.
- Return type
List[float]
- get_atom_formal_charge(atom: Any) List[float] [source]¶
Get a formal charge of an atom.
- Parameters
atom (rdkit.Chem.rdchem.Atom) – RDKit atom object
- Returns
A vector of the formal charge.
- Return type
List[float]
- get_atom_partial_charge(atom: Any) List[float] [source]¶
Get a partial charge of an atom.
- Parameters
atom (rdkit.Chem.rdchem.Atom) – RDKit atom object
- Returns
A vector of the parital charge.
- Return type
List[float]
Notes
Before using this function, you must calculate GasteigerCharge like AllChem.ComputeGasteigerCharges(mol).
- get_atom_total_degree_one_hot(atom: Any, allowable_set: List[int] = [0, 1, 2, 3, 4, 5], include_unknown_set: bool = True) List[float] [source]¶
Get an one-hot feature of the degree which an atom has.
- Parameters
atom (rdkit.Chem.rdchem.Atom) – RDKit atom object
allowable_set (List[int]) – The degree to consider. The default set is [0, 1, …, 5]
include_unknown_set (bool, default True) – If true, the index of all types not in allowable_set is len(allowable_set).
- Returns
A one-hot vector of the degree which an atom has. If include_unknown_set is False, the length is len(allowable_set). If include_unknown_set is True, the length is len(allowable_set) + 1.
- Return type
List[float]
- get_bond_type_one_hot(bond: Any, allowable_set: List[str] = ['SINGLE', 'DOUBLE', 'TRIPLE', 'AROMATIC'], include_unknown_set: bool = False) List[float] [source]¶
Get an one-hot feature of bond type.
- Parameters
bond (rdkit.Chem.rdchem.Bond) – RDKit bond object
allowable_set (List[str]) – The bond types to consider. The default set is [“SINGLE”, “DOUBLE”, “TRIPLE”, “AROMATIC”].
include_unknown_set (bool, default False) – If true, the index of all types not in allowable_set is len(allowable_set).
- Returns
A one-hot vector of the bond type. If include_unknown_set is False, the length is len(allowable_set). If include_unknown_set is True, the length is len(allowable_set) + 1.
- Return type
List[float]
- get_bond_is_in_same_ring_one_hot(bond: Any) List[float] [source]¶
Get an one-hot feature about whether atoms of a bond is in the same ring or not.
- Parameters
bond (rdkit.Chem.rdchem.Bond) – RDKit bond object
- Returns
A one-hot vector of whether a bond is in the same ring or not.
- Return type
List[float]
- get_bond_is_conjugated_one_hot(bond: Any) List[float] [source]¶
Get an one-hot feature about whether a bond is conjugated or not.
- Parameters
bond (rdkit.Chem.rdchem.Bond) – RDKit bond object
- Returns
A one-hot vector of whether a bond is conjugated or not.
- Return type
List[float]
- get_bond_stereo_one_hot(bond: Any, allowable_set: List[str] = ['STEREONONE', 'STEREOANY', 'STEREOZ', 'STEREOE'], include_unknown_set: bool = True) List[float] [source]¶
Get an one-hot feature of the stereo configuration of a bond.
- Parameters
bond (rdkit.Chem.rdchem.Bond) – RDKit bond object
allowable_set (List[str]) – The stereo configuration types to consider. The default set is [“STEREONONE”, “STEREOANY”, “STEREOZ”, “STEREOE”].
include_unknown_set (bool, default True) – If true, the index of all types not in allowable_set is len(allowable_set).
- Returns
A one-hot vector of the stereo configuration of a bond. If include_unknown_set is False, the length is len(allowable_set). If include_unknown_set is True, the length is len(allowable_set) + 1.
- Return type
List[float]
- get_bond_graph_distance_one_hot(bond: Any, graph_dist_matrix: numpy.ndarray, allowable_set: List[int] = [1, 2, 3, 4, 5, 6, 7], include_unknown_set: bool = True) List[float] [source]¶
Get an one-hot feature of graph distance.
- Parameters
bond (rdkit.Chem.rdchem.Bond) – RDKit bond object
graph_dist_matrix (np.ndarray) – The return value of Chem.GetDistanceMatrix(mol). The shape is (num_atoms, num_atoms).
allowable_set (List[int]) – The graph distance types to consider. The default set is [1, 2, …, 7].
include_unknown_set (bool, default False) – If true, the index of all types not in allowable_set is len(allowable_set).
- Returns
A one-hot vector of the graph distance. If include_unknown_set is False, the length is len(allowable_set). If include_unknown_set is True, the length is len(allowable_set) + 1.
- Return type
List[float]
Grover Utilities¶
- extract_grover_attributes(molgraph: deepchem.feat.graph_data.BatchGraphData)[source]¶
Utility to extract grover attributes for grover model
- Parameters
molgraph (BatchGraphData) – A batched graph data representing a collection of molecules.
- Returns
graph_attributes – A tuple containing atom features, bond features, atom to bond mapping, bond to atom mapping, bond to reverse bond mapping, atom to atom mapping, atom scope, bond scope, functional group labels and other additional features.
- Return type
Tuple
Example
>>> import deepchem as dc >>> from deepchem.feat.graph_data import BatchGraphData >>> smiles = ['CC', 'CCC', 'CC(=O)C'] >>> featurizer = dc.feat.GroverFeaturizer(features_generator=dc.feat.CircularFingerprint()) >>> graphs = featurizer.featurize(smiles) >>> molgraph = BatchGraphData(graphs) >>> attributes = extract_grover_attributes(molgraph)
Debug Utilities¶
Docking Utilities¶
These utilities assist in file preparation and processing for molecular docking.
- write_vina_conf(protein_filename: str, ligand_filename: str, centroid: numpy.ndarray, box_dims: numpy.ndarray, conf_filename: str, num_modes: int = 9, exhaustiveness: Optional[int] = None) None [source]¶
Writes Vina configuration file to disk.
Autodock Vina accepts a configuration file which provides options under which Vina is invoked. This utility function writes a vina configuration file which directs Autodock vina to perform docking under the provided options.
- Parameters
protein_filename (str) – Filename for protein
ligand_filename (str) – Filename for the ligand
centroid (np.ndarray) – A numpy array with shape (3,) holding centroid of system
box_dims (np.ndarray) – A numpy array of shape (3,) holding the size of the box to dock
conf_filename (str) – Filename to write Autodock Vina configuration to.
num_modes (int, optional (default 9)) – The number of binding modes Autodock Vina should find
exhaustiveness (int, optional) – The exhaustiveness of the search to be performed by Vina
- write_gnina_conf(protein_filename: str, ligand_filename: str, conf_filename: str, num_modes: int = 9, exhaustiveness: Optional[int] = None, **kwargs) None [source]¶
Writes GNINA configuration file to disk.
GNINA accepts a configuration file which provides options under which GNINA is invoked. This utility function writes a configuration file which directs GNINA to perform docking under the provided options.
- Parameters
protein_filename (str) – Filename for protein
ligand_filename (str) – Filename for the ligand
conf_filename (str) – Filename to write Autodock Vina configuration to.
num_modes (int, optional (default 9)) – The number of binding modes GNINA should find
exhaustiveness (int, optional) – The exhaustiveness of the search to be performed by GNINA
kwargs – Args supported by GNINA documented here https://github.com/gnina/gnina#usage
- load_docked_ligands(pdbqt_output: str) Tuple[List[Any], List[float]] [source]¶
This function loads ligands docked by autodock vina.
Autodock vina writes outputs to disk in a PDBQT file format. This PDBQT file can contain multiple docked “poses”. Recall that a pose is an energetically favorable 3D conformation of a molecule. This utility function reads and loads the structures for multiple poses from vina’s output file.
- Parameters
pdbqt_output (str) – Should be the filename of a file generated by autodock vina’s docking software.
- Returns
Tuple of molecules, scores. molecules is a list of rdkit molecules with 3D information. scores is the associated vina score.
- Return type
Tuple[List[rdkit.Chem.rdchem.Mol], List[float]]
Notes
This function requires RDKit to be installed.
- prepare_inputs(protein: str, ligand: str, replace_nonstandard_residues: bool = True, remove_heterogens: bool = True, remove_water: bool = True, add_hydrogens: bool = True, pH: float = 7.0, optimize_ligand: bool = True, pdb_name: Optional[str] = None) Tuple[Any, Any] [source]¶
This prepares protein-ligand complexes for docking.
Autodock Vina requires PDB files for proteins and ligands with sensible inputs. This function uses PDBFixer and RDKit to ensure that inputs are reasonable and ready for docking. Default values are given for convenience, but fixing PDB files is complicated and human judgement is required to produce protein structures suitable for docking. Always inspect the results carefully before trying to perform docking.
- Parameters
protein (str) – Filename for protein PDB file or a PDBID.
ligand (str) – Either a filename for a ligand PDB file or a SMILES string.
replace_nonstandard_residues (bool (default True)) – Replace nonstandard residues with standard residues.
remove_heterogens (bool (default True)) – Removes residues that are not standard amino acids or nucleotides.
remove_water (bool (default True)) – Remove water molecules.
add_hydrogens (bool (default True)) – Add missing hydrogens at the protonation state given by pH.
pH (float (default 7.0)) – Most common form of each residue at given pH value is used.
optimize_ligand (bool (default True)) – If True, optimize ligand with RDKit. Required for SMILES inputs.
pdb_name (Optional[str]) – If given, write sanitized protein and ligand to files called “pdb_name.pdb” and “ligand_pdb_name.pdb”
- Returns
Tuple of protein_molecule, ligand_molecule with 3D information.
- Return type
Tuple[RDKitMol, RDKitMol]
Note
This function requires RDKit and OpenMM to be installed. Read more about PDBFixer here: https://github.com/openmm/pdbfixer.
Examples
>>> p, m = prepare_inputs('3cyx', 'CCC')
>> p.GetNumAtoms() >> m.GetNumAtoms()
>>> p, m = prepare_inputs('3cyx', 'CCC', remove_heterogens=False)
>> p.GetNumAtoms()
- read_gnina_log(log_file: str) numpy.ndarray [source]¶
Read GNINA logfile and get docking scores.
GNINA writes computed binding affinities to a logfile.
- Parameters
log_file (str) – Filename of logfile generated by GNINA.
- Returns
scores – Array of binding affinity (kcal/mol), CNN pose score, and CNN affinity for each binding mode.
- Return type
np.array, dimension (num_modes, 3)
Print Threshold¶
The printing threshold controls how many dataset elements are printed
when dc.data.Dataset
objects are converted to strings or
represnted in the IPython repl.
- get_print_threshold() int [source]¶
Return the printing threshold for datasets.
The print threshold is the number of elements from ids/tasks to print when printing representations of Dataset objects.
- Returns
threshold – Number of elements that will be printed
- Return type
int
- set_print_threshold(threshold: int)[source]¶
Set print threshold
The print threshold is the number of elements from ids/tasks to print when printing representations of Dataset objects.
- Parameters
threshold (int) – Number of elements to print.
- get_max_print_size() int [source]¶
Return the max print size for a dataset.
If a dataset is large, printing self.ids as part of a string representation can be very slow. This field controls the maximum size for a dataset before ids are no longer printed.
- Returns
max_print_size – Maximum length of a dataset for ids to be printed in string representation.
- Return type
int
- set_max_print_size(max_print_size: int)[source]¶
Set max_print_size
If a dataset is large, printing self.ids as part of a string representation can be very slow. This field controls the maximum size for a dataset before ids are no longer printed.
- Parameters
max_print_size (int) – Maximum length of a dataset for ids to be printed in string representation.
Fake Data Generator¶
The utilities here are used to generate random sample data which can be used for testing model architectures or other purposes.
- class FakeGraphGenerator(min_nodes: int = 10, max_nodes: int = 10, n_node_features: int = 5, avg_degree: int = 4, n_edge_features: int = 3, n_classes: int = 2, task: str = 'graph', **kwargs)[source]¶
Generates a random graphs which can be used for testing or other purposes.
The generated graph supports both node-level and graph-level labels.
Example
>>> from deepchem.utils.fake_data_generator import FakeGraphGenerator >>> fgg = FakeGraphGenerator(min_nodes=8, max_nodes=10, n_node_features=5, avg_degree=8, n_edge_features=3, n_classes=2, task='graph', z=5) >>> graphs = fgg.sample(n_graphs=10) >>> type(graphs) <class 'deepchem.data.datasets.NumpyDataset'> >>> type(graphs.X[0]) <class 'deepchem.feat.graph_data.GraphData'> >>> len(graphs) == 10 # num_graphs True
Note
The FakeGraphGenerator class is based on torch_geometric.dataset.FakeDataset class.
- __init__(min_nodes: int = 10, max_nodes: int = 10, n_node_features: int = 5, avg_degree: int = 4, n_edge_features: int = 3, n_classes: int = 2, task: str = 'graph', **kwargs)[source]¶
- Parameters
min_nodes (int, default 10) – Minimum number of permissible nodes in a graph
max_nodes (int, default 10) – Maximum number of permissible nodes in a graph
n_node_features (int, default 5) – Average number of node features in a graph
avg_degree (int, default 4) – Average degree of the graph (avg_degree should be a positive number greater than the min_nodes)
n_edge_features (int, default 3) – Average number of features in the edge
task (str, default 'graph') – Indicates node-level labels or graph-level labels
kwargs (optional) – Additional graph attributes and their shapes , e.g. global_features = 5
- sample(n_graphs: int = 100) deepchem.data.datasets.NumpyDataset [source]¶
Samples graphs
- Parameters
n_graphs (int, default 100) – Number of graphs to generate
- Returns
graphs – Generated Graphs
- Return type
Electron Sampler¶
The utilities here are used to sample electrons in a given molecule and update it using monte carlo methods, which can be used for methods like Variational Monte Carlo, etc.
- class ElectronSampler(central_value: numpy.ndarray, f: Callable[[numpy.ndarray], numpy.ndarray], batch_no: int = 10, x: numpy.ndarray = array([], dtype=float64), steps: int = 200, steps_per_update: int = 10, seed: Optional[int] = None, symmetric: bool = True, simultaneous: bool = True)[source]¶
This class enables to initialize electron’s position using gauss distribution around a nucleus and update using Markov Chain Monte-Carlo(MCMC) moves.
Using the probability obtained from the square of magnitude of wavefunction of a molecule/atom, MCMC steps can be performed to get the electron’s positions and further update the wavefunction. This method is primarily used in methods like Variational Monte Carlo to sample electrons around the nucleons. Sampling can be done in 2 ways: -Simultaneous: All the electrons’ positions are updated all at once.
-Single-electron: MCMC steps are performed only a particular electron, given their index value.
Further these moves can be done in 2 methods: -Symmetric: In this configuration, the standard deviation for all the steps are uniform.
-Asymmetric: In this configuration, the standard deviation are not uniform and typically the standard deviation is obtained a function like harmonic distances, etc.
Irrespective of these methods, the initialization is done uniformly around the respective nucleus and the number of electrons specified.
Example
>>> from deepchem.utils.electron_sampler import ElectronSampler >>> test_f = lambda x: 2*np.log(np.random.uniform(low=0,high=1.0,size=np.shape(x)[0])) >>> distribution=ElectronSampler(central_value=np.array([[1,1,3],[3,2,3]]),f=test_f,seed=0,batch_no=2,steps=1000,) >>> distribution.gauss_initialize_position(np.array([[1],[2]]))
>> print(distribution.x) [[[[1.03528105 1.00800314 3.01957476]]
[[3.01900177 1.99697286 2.99793562]]
[[3.00821197 2.00288087 3.02908547]]]
[[[1.04481786 1.03735116 2.98045444]]
[[3.01522075 2.0024335 3.00887726]]
[[3.00667349 2.02988158 2.99589683]]]]
>>> distribution.move() 0.5115
>> print(distribution.x) [[[[-0.32441754 1.23330263 2.67927645]]
[[ 3.42250997 2.23617126 3.55806632]]
[[ 3.37491385 1.54374006 3.13575241]]]
[[[ 0.49067726 1.03987841 3.70277884]]
[[ 3.5631939 1.68703947 2.5685874 ]]
[[ 2.84560249 1.73998364 3.41274181]]]]
- __init__(central_value: numpy.ndarray, f: Callable[[numpy.ndarray], numpy.ndarray], batch_no: int = 10, x: numpy.ndarray = array([], dtype=float64), steps: int = 200, steps_per_update: int = 10, seed: Optional[int] = None, symmetric: bool = True, simultaneous: bool = True)[source]¶
- Parameters
central_value (np.ndarray) – Contains each nucleus’ coordinates in a 2D array. The shape of the array should be(number_of_nucleus,3).Ex: [[1,2,3],[3,4,5],..]
f (Callable[[np.ndarray],np.ndarray]) – A function that should give the twice the log probability of wavefunction of the molecular system when called. Should taken in a 4D array of electron’s positions(x) as argument and return a numpy array containing the log probabilities of each batch.
batch_no (int, optional (default 10)) – Number of batches of the electron’s positions to be initialized.
x (np.ndarray, optional (default np.ndarray([]))) – Contains the electron’s coordinates in a 4D array. The shape of the array should be(batch_no,no_of_electrons,1,3). Can be a 1D empty array, when electron’s positions are yet to be initialized.
steps (int, optional (default 10)) – The number of MCMC steps to be performed when the moves are called.
steps_per_update (int (default 10)) – The number of steps after which the parameters of the MCMC gets updated.
seed (int, optional (default None)) – Random seed to use.
symmetric (bool, optional(default True)) – If true, symmetric moves will be used, else asymmetric moves will be followed.
simultaneous (bool, optional(default True)) – If true, MCMC steps will be performed on all the electrons, else only a single electron gets updated.
- harmonic_mean(y: numpy.ndarray) numpy.ndarray [source]¶
Calculates the harmonic mean of the value ‘y’ from the self.central value. The numpy array returned is typically scaled up to get the standard deviation matrix.
- Parameters
y (np.ndarray) – Containing the data distribution. Shape of y should be (batch,no_of_electron,1,3)
- Returns
Contains the harmonic mean of the data distribution of each batch. Shape of the array obtained (batch_no, no_of_electrons,1,1)
- Return type
np.ndarray
- log_prob_gaussian(y: numpy.ndarray, mu: numpy.ndarray, sigma: numpy.ndarray) numpy.ndarray [source]¶
Calculates the log probability of a gaussian distribution, given the mean and standard deviation
- Parameters
y (np.ndarray) – data for which the log normal distribution is to be found
mu (np.ndarray) – Means wrt which the log normal is calculated. Same shape as x or should be brodcastable to x
sigma (np.ndarray,) – The standard deviation of the log normal distribution. Same shape as x or should be brodcastable to x
- Returns
Log probability of gaussian distribution, with the shape - (batch_no,).
- Return type
np.ndarray
- gauss_initialize_position(no_sample: numpy.ndarray, stddev: float = 0.02)[source]¶
Initializes the position around a central value as mean sampled from a gauss distribution and updates self.x. :param no_sample: Contains the number of samples to initialize under each mean. should be in the form [[3],[2]..], where here it means 3 samples and 2 samples around the first entry and second entry,respectively in self.central_value is taken. :type no_sample: np.ndarray, :param stddev: contains the stddev with which the electrons’ coordinates are initialized :type stddev: float, optional (default 0.02)
- electron_update(lp1, lp2, move_prob, ratio, x2) numpy.ndarray [source]¶
Performs sampling & parameter updates of electrons and appends the sampled electrons to self.sampled_electrons.
- Parameters
lp1 (np.ndarray) – Log probability of initial parameter state.
lp2 (np.ndarray) – Log probability of the new sampled state.
move_prob (np.ndarray) – Sampled log probabilty of the electron moving from the initial to final state, sampled assymetrically or symetrically.
ratio (np.ndarray) – Ratio of lp1 and lp2 state.
x2 (np.ndarray) – Numpy array of the new sampled electrons.
- Returns
lp1 – The update log probability of initial parameter state.
- Return type
np.ndarray
- move(stddev: float = 0.02, asymmetric_func: Optional[Callable[[numpy.ndarray], numpy.ndarray]] = None, index: Optional[int] = None) float [source]¶
Performs Metropolis-Hasting move for self.x(electrons). The type of moves to be followed -(simultaneous or single-electron, symmetric or asymmetric) have been specified when calling the class. The self.x array is replaced with a new array at the end of each step containing the new electron’s positions.
- Parameters
asymmetric_func (Callable[[np.ndarray],np.ndarray], optional(default None)) – Should be specified for an asymmetric move.The function should take in only 1 argument- y: a numpy array wrt to which mean should be calculated. This function should return the mean for the asymmetric proposal. For ferminet, this function is the harmonic mean of the distance between the electron and the nucleus.
stddev (float, optional (default 0.02)) – Specifies the standard deviation in the case of symmetric moves and the scaling factor of the standard deviation matrix in the case of asymmetric moves.
index (int, optional (default None)) – Specifies the index of the electron to be updated in the case of a single electron move.
- Returns
accepted move ratio of the MCMC steps.
- Return type
float
Density Functional Theory Utilities¶
The utilites here are used to create an object that contains information about a system’s self-consistent iteration steps and other processes.
- class KSCalc(qc: deepchem.utils.dftutils.BaseQCCalc)[source]¶
Interface to DQC’s KS calculation.
- Parameters
qc (BaseQCCalc) – object often acts as a wrapper around an engine class (from dqc.qccalc) that contains information about the self-consistent iterations.
References
Kasim, Muhammad F., and Sam M. Vinko. “Learning the exchange-correlation functional from nature with fully differentiable density functional theory.” Physical Review Letters 127.12 (2021): 126403. https://github.com/diffqc/dqc/blob/master/dqc/qccalc/ks.py
- __init__(qc: deepchem.utils.dftutils.BaseQCCalc)[source]¶
- energy() torch.Tensor [source]¶
- Return type
The total energy of the Kohn-Sham calculation for a particular system.
- aodmtot() torch.Tensor [source]¶
Both interacting and non-interacting system’s total energy can be expressed in terms of the density matrix. The ground state properties of a system can be calculated by minimizing the energy w.r.t the density matrix.
- Return type
The total density matrix in atomic orbital bases.
- dens(rgrid: torch.Tensor) torch.Tensor [source]¶
The ground state density n(r) of a system.
- Parameters
rgrid (torch.Tensor) – Calculate integration grid using dqc.grid.
- Returns
The total density profile in the given grid
Reference
———
https (//github.com/diffqc/dqc/blob/master/dqc/grid/base_grid.py)
- hashstr(s: str) str [source]¶
Encodes the string into hashed format - hexadecimal digits.
- Parameters
s (str) –
- class BaseGrid[source]¶
Interface to DQC’s BaseGrid class. BaseGrid is a class that regulates the integration points over the spatial dimensions. :param qc: object often acts as a wrapper around an engine class (from dqc.qccalc) that contains information about the self-consistent iterations. :type qc: BaseQCCalc
References
Kasim, Muhammad F., and Sam M. Vinko. “Learning the exchange-correlation functional from nature with fully differentiable density functional theory.” Physical Review Letters 127.12 (2021): 126403. https://github.com/diffqc/dqc/blob/0fe821fc92cb3457fb14f6dff0c223641c514ddb/dqc/grid/base_grid.py
- abstract get_dvolume() torch.Tensor [source]¶
Obtain the torch.tensor containing the dV elements for the integration. :returns: The dV elements for the integration. *BG is the length of the BaseGrid. :rtype: torch.tensor (*BG, ngrid)
- class BaseQCCalc[source]¶
Quantum Chemistry calculation. This class is the interface to the users regarding parameters that can be calculated after the self-consistent iterations (or other processes).
References
Kasim, Muhammad F., and Sam M. Vinko. “Learning the exchange-correlation functional from nature with fully differentiable density functional theory.” Physical Review Letters 127.12 (2021): 126403. https://github.com/diffqc/dqc/blob/master/dqc/utils/datastruct.py
- abstract run(**kwargs)[source]¶
Run the calculation. Note that this method can be invoked several times for one object to try for various self-consistent options to reach convergence.
- abstract aodm() Union[torch.Tensor, deepchem.utils.dftutils.SpinParam[torch.Tensor]] [source]¶
Returns the density matrix in atomic orbital. For polarized case, it returns a SpinParam of 2 tensors representing the density matrices for spin-up and spin-down.
- abstract dm2energy(dm: Union[torch.Tensor, deepchem.utils.dftutils.SpinParam[torch.Tensor]]) torch.Tensor [source]¶
Calculate the energy from the given density matrix.
- Parameters
dm (torch.Tensor or SpinParam of torch.Tensor) – The input density matrix. It is tensor if restricted, and SpinParam of tensor if unrestricted.
- Returns
Tensor that represents the energy given the energy.
- Return type
torch.Tensor
- class SpinParam(u: deepchem.utils.dftutils.T, d: deepchem.utils.dftutils.T)[source]¶
Data structure to store different values for spin-up and spin-down electrons.
References
Kasim, Muhammad F., and Sam M. Vinko. “Learning the exchange-correlation functional from nature with fully differentiable density functional theory.” Physical Review Letters 127.12 (2021): 126403. https://github.com/diffqc/dqc/blob/master/dqc/utils/datastruct.py
- class _Config(THRESHOLD_MEMORY: int = 10737418240, CHUNK_MEMORY: int = 16777216, VERBOSE: int = 0)[source]¶
Contains the configuration for the DFT module
Examples
>>> from deepchem.utils.dft_utils.config import config >>> Memory_usage = 1024**4 # Sample Memory usage by some Object/Matrix >>> if Memory_usage > config.THRESHOLD_MEMORY : ... print("Overload") Overload
- THRESHOLD_MEMORY[source]¶
Threshold memory (matrix above this size should not be constructed)
- Type
int (default=10*1024**3)
- CHUNK_MEMORY[source]¶
The memory for splitting big tensors into chunks.
- Type
int (default=16*1024**2)
- VERBOSE[source]¶
Allowed Verbosity level (Defines the level of detail) Used by Looger for maintaining Logs.
- Type
int (default=0)
- -----
- 1. HamiltonCGTO
- Type
Usage it for splitting big tensors into chunks.
Pytorch Utilities¶
- unsorted_segment_sum(data: torch.Tensor, segment_ids: torch.Tensor, num_segments: int) torch.Tensor [source]¶
Computes the sum along segments of a tensor. Analogous to tf.unsorted_segment_sum.
- Parameters
data (torch.Tensor) – A tensor whose segments are to be summed.
segment_ids (torch.Tensor) – The segment indices tensor.
num_segments (int) – The number of segments.
- Returns
tensor
- Return type
torch.Tensor
Examples
>>> segment_ids = torch.Tensor([0, 1, 0]).to(torch.int64) >>> data = torch.Tensor([[1, 2, 3, 4], [5, 6, 7, 8], [4, 3, 2, 1]]) >>> num_segments = 2 >>> result = unsorted_segment_sum(data=data, ... segment_ids=segment_ids, ... num_segments=num_segments) >>> data.shape[0] 3 >>> segment_ids.shape[0] 3 >>> len(segment_ids.shape) 1 >>> result tensor([[5., 5., 5., 5.], [5., 6., 7., 8.]])
- segment_sum(data: torch.Tensor, segment_ids: torch.Tensor) torch.Tensor [source]¶
This function computes the sum of values along segments within a tensor. It is useful when you have a tensor with segment IDs and you want to compute the sum of values for each segment. This function is analogous to tf.segment_sum. (https://www.tensorflow.org/api_docs/python/tf/math/segment_sum).
- Parameters
data (torch.Tensor) – A pytorch tensor containing the values to be summed. It can have any shape, but its rank (number of dimensions) should be at least 1.
segment_ids (torch.Tensor) – A 1-D tensor containing the indices for the segmentation. The segments can be any non-negative integer values, but they must be sorted in non-decreasing order.
- Returns
out_tensor – Tensor with the same shape as data, where each value corresponds to the sum of values within the corresponding segment.
- Return type
torch.Tensor
Examples
>>> data = torch.Tensor([[1, 2, 3, 4], [4, 3, 2, 1], [5, 6, 7, 8]]) >>> segment_ids = torch.Tensor([0, 0, 1]).to(torch.int64) >>> result = segment_sum(data=data, segment_ids=segment_ids) >>> data.shape[0] 3 >>> segment_ids.shape[0] 3 >>> len(segment_ids.shape) 1 >>> result tensor([[5., 5., 5., 5.], [5., 6., 7., 8.]])
- chunkify(a: torch.Tensor, dim: int, maxnumel: int) Generator[Tuple[torch.Tensor, int, int], None, None] [source]¶
Splits the tensor a into several chunks of size maxnumel along the dimension given by dim.
Examples
>>> import torch >>> from deepchem.utils.pytorch_utils import chunkify >>> a = torch.arange(10) >>> for chunk, istart, iend in chunkify(a, 0, 3): ... print(chunk, istart, iend) tensor([0, 1, 2]) 0 3 tensor([3, 4, 5]) 3 6 tensor([6, 7, 8]) 6 9 tensor([9]) 9 12
- Parameters
a (torch.Tensor) – The big tensor to be splitted into chunks.
dim (int) – The dimension where the tensor would be splitted.
maxnumel (int) – Maximum number of elements in a chunk.
- Returns
chunks – A generator that yields a tuple of three elements: the chunk tensor, the starting index of the chunk and the ending index of the chunk.
- Return type
Generator[Tuple[torch.Tensor, int, int], None, None]
- get_memory(a: torch.Tensor) int [source]¶
Returns the size of the tensor in bytes.
Examples
>>> import torch >>> from deepchem.utils.pytorch_utils import get_memory >>> a = torch.randn(100, 100, dtype=torch.float64) >>> get_memory(a) 80000
- Parameters
a (torch.Tensor) – The tensor to be measured.
- Returns
size – The size of the tensor in bytes.
- Return type
int
Batch Utilities¶
The utilites here are used for computing features on batch of data. Can be used inside of default_generator function.
- batch_coulomb_matrix_features(X_b: numpy.ndarray, distance_max: float = - 1, distance_min: float = 18, n_distance: int = 100)[source]¶
Computes the values for different Feature on given batch. It works as a helper function to coulomb matrix.
This function takes in a batch of Molecules represented as Coulomb Matrix.
It proceeds as follows:
It calculates the Number of atoms per molecule by counting all the non zero elements(numbers) of every molecule layer in matrix in one dimension.
The Gaussian distance is calculated using the Euclidean distance between the Cartesian coordinates of two atoms. The distance value is then passed through a Gaussian function, which transforms it into a continuous value.
Then using number of atom per molecule, calculates the atomic charge by looping over the molecule layer in the Coulomb matrix and takes the 2.4 root of the diagonal of 2X of each molecule layer. Undoing the Equation of coulomb matrix.
Atom_membership is assigned as a commomn repeating integers for all the atoms for a specific molecule.
Distance Membership encodes spatial information, assigning closer values to atoms that are in that specific molecule. All initial Distances are added a start value to them which are unique to each molecule.
Models Used in:
DTNN
- Parameters
X_b (np.ndarray) – It is a 3d Matrix containing information of each the atom’s ionic interaction with other atoms in the molecule.
distance_min (float (default -1)) – minimum distance of atom pairs (in Angstrom)
distance_max (float (default = 18)) – maximum distance of atom pairs (in Angstrom)
n_distance (int (default 100)) – granularity of distance matrix step size will be (distance_max-distance_min)/n_distance
- Returns
atom_number (np.ndarray) – Atom numbers are assigned to each atom based on their atomic properties. The atomic numbers are derived from the periodic table of elements. For example, hydrogen -> 1, carbon -> 6, and oxygen -> 8.
gaussian_dist (np.ndarray) – Gaussian distance refers to the method of representing the pairwise distances between atoms in a molecule using Gaussian functions. The Gaussian distance is calculated using the Euclidean distance between the Cartesian coordinates of two atoms. The distance value is then passed through a Gaussian function, which transforms it into a continuous value.
atom_mem (np.ndarray) – Atom membership refers to the binary representation of whether an atom belongs to a specific group or property within a molecule. It allows the model to incorporate domain-specific information and enhance its understanding of the molecule’s properties and interactions.
dist_mem_i (np.ndarray) – Distance membership i are utilized to encode spatial information and capture the influence of atom distances on the properties and interactions within a molecule. The inner membership function assigns higher values to atoms that are closer to the atoms’ interaction region, thereby emphasizing the impact of nearby atoms.
dist_mem_j (np.ndarray) – It captures the long-range effects and influences between atoms that are not in direct proximity but still contribute to the overall molecular properties. Distance membership j are utilized to encode spatial information and capture the influence of atom distances on the properties and interactions outside a molecule. The outer membership function assigns higher values to atoms that are farther to the atoms’ interaction region, thereby emphasizing the impact of farther atoms.
Examples
>>> import os >>> import deepchem as dc >>> current_dir = os.path.dirname(os.path.abspath(__file__)) >>> dataset_file = os.path.join(current_dir, 'test/assets/qm9_mini.sdf') >>> TASKS = ["alpha", "homo"] >>> loader = dc.data.SDFLoader(tasks=TASKS, ... featurizer=dc.feat.CoulombMatrix(29), ... sanitize=True) >>> data = loader.create_dataset(dataset_file, shard_size=100) >>> inputs = dc.utils.batch_utils.batch_coulomb_matrix_features(data.X)
References
- 1
Montavon, Grégoire, et al. “Learning invariant representations of molecules for atomization energy prediction.” Advances in neural information processing systems. 2012.
- batch_elements(elements: List[Any], batch_size: int)[source]¶
Combine elements into batches.
- Parameters
elements (List[Any]) – List of Elements to be combined into batches.
batch_size (int) – Batch size in which to divide.
- Returns
batch – List of Lists of elements divided into batches.
- Return type
List[Any]
Examples
>>> import deepchem as dc >>> # Prepare Data >>> inputs = [[i, i**2, i**3] for i in range(10)] >>> # Run >>> output = list(dc.utils.batch_utils.batch_elements(inputs, 3)) >>> len(output) 4
- create_input_array(sequences: Collection, max_input_length: int, reverse_input: bool, batch_size: int, input_dict: Dict, end_mark: Any)[source]¶
Create the array describing the input sequences.
It creates a 2d Matrix empty matrix according to batch size and max_length. Then iteratively fills it with the key-values from the input dictionary.
Many NLP Models like SeqToSeq has sentences as there inputs. We need to convert these sentences into numbers so that the model can do computation on them.
This function takes in the sentence then using the input_dict dictionary picks up the words/letters equivalent numerical represntation. Then makes an numpy array of it.
If the reverse_input is True, then the order of the input sequences is reversed before sending them into the encoder. This can improve performance when working with long sequences.
These values can be used to generate embeddings for further processing.
Models used in:
SeqToSeq
- Parameters
sequences (Collection) – List of sequences to be converted into input array.
reverse_input (bool) – If True, reverse the order of input sequences before sending them into the encoder. This can improve performance when working with long sequences.
batch_size (int) – Batch size of the input array.
input_dict (dict) – Dictionary containing the key-value pairs of input sequences.
end_mark (Any) – End mark for the input sequences.
- Returns
features – Numeric Representation of the given sequence according to input_dict.
- Return type
np.Array
Examples
>>> import deepchem as dc >>> # Prepare Data >>> inputs = [["a", "b"], ["b", "b", "b"]] >>> input_dict = {"c": 0, "a": 1, "b": 2} >>> # Inputs property >>> max_length = max([len(x) for x in inputs]) >>> # Without reverse input >>> output_1 = dc.utils.batch_utils.create_input_array(inputs, max_length, ... False, 2, input_dict, ... "c") >>> output_1.shape (2, 4) >>> # With revercse input >>> output_2 = dc.utils.batch_utils.create_input_array(inputs, max_length, ... True, 2, input_dict, ... "c") >>> output_2.shape (2, 4)
- create_output_array(sequences: Collection, max_output_length: int, batch_size: int, output_dict: Dict, end_mark: Any)[source]¶
Create the array describing the target sequences.
It creates a 2d Matrix empty matrix according to batch size and max_length. Then iteratively fills it with the key-values from the output dictionary.
This function is similar to create_input_array function. The only difference is that it is used for output sequences and does not have the reverse_input parameter as it is not required for output sequences.
It is used in NLP Models like SeqToSeq where the output is also a sentence and we need to convert it into numbers so that the model can do computation on them. This function takes in the sentence then using the output_dict dictionary picks up the words/letters equivalent numerical represntation. Then makes an numpy array of it.
These values can be used to generate embeddings for further processing.
Models used in:
SeqToSeq
- Parameters
sequences (Collection) – List of sequences to be converted into output array.
max_output_length (bool) – Maximum length of output sequence that may be generated
batch_size (int) – Batch size of the output array.
output_dict (dict) – Dictionary containing the key-value pairs of output sequences.
end_mark (Any) – End mark for the output sequences.
- Returns
features – Numeric Representation of the given sequence according to output_dict.
- Return type
np.Array
Examples
>>> import deepchem as dc >>> # Prepare Data >>> inputs = [["a", "b"], ["b", "b", "b"]] >>> output_dict = {"c": 0, "a": 1, "b": 2} >>> # Inputs property >>> max_length = max([len(x) for x in inputs]) >>> output = dc.utils.batch_utils.create_output_array(inputs, max_length, 2, ... output_dict, "c") >>> output.shape (2, 3)
Periodic Table Utilities¶
The Utilities here are used to computing atomic mass and radii data. These can be used by DFT and many other Molecular Models.
Equivariance Utilities¶
The utilities here refer to equivariance tools that play a vital role in mathematics and applied sciences. They excel in preserving the relationships between objects or data points when undergoing transformations such as rotations or scaling.
You can refer to the tutorials for additional information regarding equivariance and Deepchem’s support for equivariance.
- su2_generators(k: int) torch.Tensor [source]¶
Generate the generators of the special unitary group SU(2) in a given representation.
The function computes the generators of the SU(2) group for a specific representation determined by the value of ‘k’. These generators are commonly used in the study of quantum mechanics, angular momentum, and related areas of physics and mathematics. The generators are represented as matrices.
The SU(2) group is a fundamental concept in quantum mechanics and symmetry theory. The generators of the group, denoted as J_x, J_y, and J_z, represent the three components of angular momentum operators. These generators play a key role in describing the transformation properties of physical systems under rotations.
The returned tensor contains three matrices corresponding to the x, y, and z generators, usually denoted as J_x, J_y, and J_z. These matrices form a basis for the Lie algebra of the SU(2) group.
In linear algebra, specifically within the context of quantum mechanics, lowering and raising operators are fundamental concepts that play a crucial role in altering the eigenvalues of certain operators while acting on quantum states. These operators are often referred to collectively as “ladder operators.”
A lowering operator is an operator that, when applied to a quantum state, reduces the eigenvalue associated with a particular observable. In the context of SU(2), the lowering operator corresponds to J_-.
Conversely, a raising operator is an operator that increases the eigenvalue of an observable when applied to a quantum state. In the context of SU(2), the raising operator corresponds to J_+.
The z-generator matrix represents the component of angular momentum along the z-axis, often denoted as J_z. It commutes with both J_x and J_y and is responsible for quantizing the angular momentum.
Note that the dimensions of the returned tensor will be (3, 2j+1, 2j+1), where each matrix has a size of (2j+1) x (2j+1). :param k: The representation index, which determines the order of the representation. :type k: int
- Returns
A stack of three SU(2) generators, corresponding to J_x, J_z, and J_y.
- Return type
torch.Tensor
Notes
A generating set of a group is a subset $S$ of the group $G$ such that every element of $G$ can be expressed as a combination (under the group operation) of finitely many elements of the subset $S$ and their inverses.
The special unitary group $SU_n(q)$ is the set of $n*n$ unitary matrices with determinant +1. $SU(2)$ is homeomorphic with the orthogonal group $O_3^+(2)$. It is also called the unitary unimodular group and is a Lie group.
References
Examples
>>> su2_generators(1) tensor([[[ 0.0000+0.0000j, 0.7071+0.0000j, 0.0000+0.0000j], [-0.7071+0.0000j, 0.0000+0.0000j, 0.7071+0.0000j], [ 0.0000+0.0000j, -0.7071+0.0000j, 0.0000+0.0000j]], [[-0.0000-1.0000j, 0.0000+0.0000j, 0.0000+0.0000j], [ 0.0000+0.0000j, 0.0000+0.0000j, 0.0000+0.0000j], [ 0.0000+0.0000j, 0.0000+0.0000j, 0.0000+1.0000j]], [[ 0.0000-0.0000j, 0.0000+0.7071j, 0.0000-0.0000j], [ 0.0000+0.7071j, 0.0000-0.0000j, 0.0000+0.7071j], [ 0.0000-0.0000j, 0.0000+0.7071j, 0.0000-0.0000j]]])
- so3_generators(k: int) torch.Tensor [source]¶
Construct the generators of the SO(3) Lie algebra for a given quantum angular momentum.
The function generates the generators of the special orthogonal group SO(3), which represents the group of rotations in three-dimensional space. Its Lie algebra, which consists of the generators of infinitesimal rotations, is often used in physics to describe angular momentum operators. The generators of the Lie algebra can be related to the SU(2) group, and this function uses a transformation to convert the SU(2) generators to the SO(3) basis.
The primary significance of the SO(3) group lies in its representation of three-dimensional rotations. Each matrix in SO(3) corresponds to a unique rotation, capturing the intricate ways in which objects can be oriented in 3D space. This concept finds application in numerous fields, ranging from physics to engineering.
- Parameters
k (int) – The representation index, which determines the order of the representation.
- Returns
A stack of three SO(3) generators, corresponding to J_x, J_z, and J_y.
- Return type
torch.Tensor
Notes
The special orthogonal group $SO_n(q)$ is the subgroup of the elements of general orthogonal group $GO_n(q)$ with determinant 1. $SO_3$ (often written $SO(3)$) is the rotation group for three-dimensional space.
These matrices are orthogonal, which means their rows and columns form mutually perpendicular unit vectors. This preservation of angles and lengths makes orthogonal matrices fundamental in various mathematical and practical applications.
The “special” part of $SO(3)$ refers to the determinant of these matrices being $+1$. The determinant is a scalar value that indicates how much a matrix scales volumes. A determinant of $+1$ ensures that the matrix represents a rotation in three-dimensional space without involving any reflection or scaling operations that would reverse the orientation of space.
References
- 1
- 2
https://en.wikipedia.org/wiki/3D_rotation_group#Connection_between_SO(3)_and_SU(2)
- 3
https://www.pas.rochester.edu/assets/pdf/undergraduate/su-2s_double_covering_of_so-3.pdf
Examples
>>> so3_generators(1) tensor([[[ 0.0000, 0.0000, 0.0000], [ 0.0000, 0.0000, -1.0000], [ 0.0000, 1.0000, 0.0000]], [[ 0.0000, 0.0000, 1.0000], [ 0.0000, 0.0000, 0.0000], [-1.0000, 0.0000, 0.0000]], [[ 0.0000, -1.0000, 0.0000], [ 1.0000, 0.0000, 0.0000], [ 0.0000, 0.0000, 0.0000]]])
- change_basis_real_to_complex(k: int, dtype: Optional[torch.dtype] = None, device: Optional[torch.device] = None) torch.Tensor [source]¶
Construct a transformation matrix to change the basis from real to complex spherical harmonics.
This function constructs a transformation matrix Q that converts real spherical harmonics into complex spherical harmonics. It operates on the basis functions $Y_{ell m}$ and $Y_{ell}^{m}$, and accounts for the relationship between the real and complex forms of these harmonics as defined in the provided mathematical expressions.
The resulting transformation matrix Q is used to change the basis of vectors or tensors of real spherical harmonics to their complex counterparts.
- Parameters
k (int) – The representation index, which determines the order of the representation.
dtype (torch.dtype, optional) – The data type for the output tensor. If not provided, the function will infer it. Default is None.
device (torch.device, optional) – The device where the output tensor will be placed. If not provided, the function will use the default device. Default is None.
- Returns
A transformation matrix Q that changes the basis from real to complex spherical harmonics.
- Return type
torch.Tensor
Notes
Spherical harmonics Y_l^m are a family of functions that are defined on the surface of a unit sphere. They are used to represent various physical and mathematical phenomena that exhibit spherical symmetry. The indices l and m represent the degree and order of the spherical harmonics, respectively.
The conversion from real to complex spherical harmonics is achieved by applying specific transformation coefficients to the real-valued harmonics. These coefficients are derived from the properties of spherical harmonics.
References
Examples
# The transformation matrix generated is used to change the basis of a vector of # real spherical harmonics with representation index 1 to complex spherical harmonics. >>> change_basis_real_to_complex(1) tensor([[-0.7071+0.0000j, 0.0000+0.0000j, 0.0000-0.7071j],
[ 0.0000+0.0000j, 0.0000-1.0000j, 0.0000+0.0000j], [-0.7071+0.0000j, 0.0000+0.0000j, 0.0000+0.7071j]])
- wigner_D(k: int, alpha: torch.Tensor, beta: torch.Tensor, gamma: torch.Tensor) torch.Tensor [source]¶
Wigner D matrix representation of the SO(3) rotation group.
The function computes the Wigner D matrix representation of the SO(3) rotation group for a given representation index ‘k’ and rotation angles ‘alpha’, ‘beta’, and ‘gamma’. The resulting matrix satisfies properties of the SO(3) group representation.
- Parameters
k (int) – The representation index, which determines the order of the representation.
alpha (torch.Tensor) – Rotation angles (in radians) around the Y axis, applied third.
beta (torch.Tensor) – Rotation angles (in radians) around the X axis, applied second.
gamma (torch.Tensor)) – Rotation angles (in radians) around the Y axis, applied first.
- Returns
The Wigner D matrix of shape (#angles, 2k+1, 2k+1).
- Return type
torch.Tensor
Notes
The Wigner D-matrix is a unitary matrix in an irreducible representation of the groups SU(2) and SO(3).
The Wigner D-matrix is used in quantum mechanics to describe the action of rotations on states of particles with angular momentum. It is a key concept in the representation theory of the rotation group SO(3), and it plays a crucial role in various physical contexts.
Examples
>>> k = 1 >>> alpha = torch.tensor([0.1, 0.2]) >>> beta = torch.tensor([0.3, 0.4]) >>> gamma = torch.tensor([0.5, 0.6]) >>> wigner_D_matrix = wigner_D(k, alpha, beta, gamma) >>> wigner_D_matrix tensor([[[ 0.8275, 0.0295, 0.5607], [ 0.1417, 0.9553, -0.2593], [-0.5433, 0.2940, 0.7863]], [[ 0.7056, 0.0774, 0.7044], [ 0.2199, 0.9211, -0.3214], [-0.6737, 0.3816, 0.6329]]])