Featurizers

DeepChem contains an extensive collection of featurizers. If you haven’t run into this terminology before, a “featurizer” is chunk of code which transforms raw input data into a processed form suitable for machine learning. Machine learning methods often need data to be pre-chewed for them to process. Think of this like a mama penguin chewing up food so the baby penguin can digest it easily.

Now if you’ve watched a few introductory deep learning lectures, you might ask, why do we need something like a featurizer? Isn’t part of the promise of deep learning that we can learn patterns directly from raw data?

Unfortunately it turns out that deep learning techniques need featurizers just like normal machine learning methods do. Arguably, they are less dependent on sophisticated featurizers and more capable of learning sophisticated patterns from simpler data. But nevertheless, deep learning systems can’t simply chew up raw files. For this reason, deepchem provides an extensive collection of featurization methods which we will review on this page.

Featurizer

The dc.feat.Featurizer class is the abstract parent class for all featurizers.

class deepchem.feat.Featurizer[source]

Abstract class for calculating a set of features for a datapoint.

This class is abstract and cannot be invoked directly. You’ll likely only interact with this class if you’re a developer. In that case, you might want to make a child class which implements the _featurize method for calculating features for a single datapoints if you’d like to make a featurizer for a new datatype.

__init__

Initialize self. See help(type(self)) for accurate signature.

featurize(datapoints, log_every_n=1000)[source]

Calculate features for datapoints.

Parameters:datapoints (iterable) – A sequence of objects that you’d like to featurize. Subclassses of Featurizer should instantiate the _featurize method that featurizes objects in the sequence.
Returns:
Return type:A numpy array containing a featurized representation of datapoints.

MolecularFeaturizer

Molecular Featurizers are those that work with datasets of molecules.

class deepchem.feat.MolecularFeaturizer[source]

Abstract class for calculating a set of features for a molecule.

The defining feature of a MolecularFeaturizer is that it uses SMILES strings and RDKIT molecule objects to represent small molecules. All other featurizers which are subclasses of this class should plan to process input which comes as smiles strings or RDKIT molecules.

Child classes need to implement the _featurize method for calculating features for a single molecule.

Note

In general, subclasses of this class will require RDKit to be installed.

__init__

Initialize self. See help(type(self)) for accurate signature.

featurize(molecules, log_every_n=1000)[source]

Calculate features for molecules.

Parameters:molecules (RDKit Mol / SMILES string /iterable) – RDKit Mol, or SMILES string or iterable sequence of RDKit mols/SMILES strings.
Returns:
  • A numpy array containing a featurized representation of
  • datapoints.

Here are some constants that are used by the graph convolutional featurizers for molecules.

class deepchem.feat.graph_features.GraphConvConstants[source]

This class defines a collection of constants which are useful for graph convolutions on molecules.

__init__

Initialize self. See help(type(self)) for accurate signature.

bond_fdim_base = 6
intervals = [1, 6, 48, 384, 1536, 9216, 27648]

Possible stereochemistry. We use E-Z notation for stereochemistry https://en.wikipedia.org/wiki/E%E2%80%93Z_notation

possible_atom_list = ['C', 'N', 'O', 'S', 'F', 'P', 'Cl', 'Mg', 'Na', 'Br', 'Fe', 'Ca', 'Cu', 'Mc', 'Pd', 'Pb', 'K', 'I', 'Al', 'Ni', 'Mn']

Allowed Numbers of Hydrogens

possible_bond_stereo = ['STEREONONE', 'STEREOANY', 'STEREOZ', 'STEREOE']

Number of different bond types not counting stereochemistry.

possible_chirality_list = ['R', 'S']

The set of all values allowed.

possible_formal_charge_list = [-3, -2, -1, 0, 1, 2, 3]

This is a placeholder for documentation. These will be replaced with corresponding values of the rdkit HybridizationType

possible_hybridization_list = ['SP', 'SP2', 'SP3', 'SP3D', 'SP3D2']

Allowed number of radical electrons.

possible_numH_list = [0, 1, 2, 3, 4]

Allowed Valences for Atoms

possible_number_radical_e_list = [0, 1, 2]

Allowed types of Chirality

possible_valence_list = [0, 1, 2, 3, 4, 5, 6]

Allowed Formal Charges for Atoms

reference_lists = [['C', 'N', 'O', 'S', 'F', 'P', 'Cl', 'Mg', 'Na', 'Br', 'Fe', 'Ca', 'Cu', 'Mc', 'Pd', 'Pb', 'K', 'I', 'Al', 'Ni', 'Mn'], [0, 1, 2, 3, 4], [0, 1, 2, 3, 4, 5, 6], [-3, -2, -1, 0, 1, 2, 3], [0, 1, 2], ['SP', 'SP2', 'SP3', 'SP3D', 'SP3D2'], ['R', 'S']]

The number of different values that can be taken. See get_intervals()

There are a number of helper methods used by the graph convolutional classes which we document here.

deepchem.feat.graph_features.one_of_k_encoding(x, allowable_set)[source]

Encodes elements of a provided set as integers.

Parameters:
  • x (object) – Must be present in allowable_set.
  • allowable_set (list) – List of allowable quantities.

Example

>>> import deepchem as dc
>>> dc.feat.graph_features.one_of_k_encoding("a", ["a", "b", "c"])
[True, False, False]
Raises:ValueError if x is not in allowable_set.
deepchem.feat.graph_features.one_of_k_encoding_unk(x, allowable_set)[source]

Maps inputs not in the allowable set to the last element.

Unlike one_of_k_encoding, if x is not in allowable_set, this method pretends that x is the last element of allowable_set.

Parameters:
  • x (object) – Must be present in allowable_set.
  • allowable_set (list) – List of allowable quantities.

Examples

>>> dc.feat.graph_features.one_of_k_encoding_unk("s", ["a", "b", "c"])
[False, False, True]
deepchem.feat.graph_features.get_intervals(l)[source]

For list of lists, gets the cumulative products of the lengths

Note that we add 1 to the lengths of all lists (to avoid an empty list propagating a 0).

Parameters:l (list of lists) – Returns the cumulative product of these lengths.

Examples

>>> dc.feat.graph_features.get_intervals([[1], [1, 2], [1, 2, 3]])
[1, 3, 12]
>>> dc.feat.graph_features.get_intervals([[1], [], [1, 2], [1, 2, 3]])
[1, 1, 3, 12]
deepchem.feat.graph_features.safe_index(l, e)[source]

Gets the index of e in l, providing an index of len(l) if not found

Parameters:
  • l (list) – List of values
  • e (object) – Object to check whether e is in l

Examples

>>> dc.feat.graph_features.safe_index([1, 2, 3], 1)
0
>>> dc.feat.graph_features.safe_index([1, 2, 3], 7)
3
deepchem.feat.graph_features.get_feature_list(atom)[source]

Returns a list of possible features for this atom.

Parameters:atom (RDKit.rdchem.Atom) – Atom to get features for

Examples

>>> from rdkit import Chem
>>> mol = Chem.MolFromSmiles("C")
>>> atom = mol.GetAtoms()[0]
>>> dc.feat.graph_features.get_feature_list(atom)
[0, 4, 4, 3, 0, 2]

Note

This method requires RDKit to be installed.

Returns:features – List of length 6. The i-th value in this list provides the index of the atom in the corresponding feature value list. The 6 feature values lists for this function are [GraphConvConstants.possible_atom_list, GraphConvConstants.possible_numH_list, GraphConvConstants.possible_valence_list, GraphConvConstants.possible_formal_charge_list, GraphConvConstants.possible_num_radical_e_list].
Return type:list
deepchem.feat.graph_features.features_to_id(features, intervals)[source]

Convert list of features into index using spacings provided in intervals

Parameters:
  • features (list) – List of features as returned by get_feature_list()
  • intervals (list) – List of intervals as returned by get_intervals()
Returns:

id – The index in a feature vector given by the given set of features.

Return type:

int

deepchem.feat.graph_features.id_to_features(id, intervals)[source]

Given an index in a feature vector, return the original set of features.

Parameters:
  • id (int) – The index in a feature vector given by the given set of features.
  • intervals (list) – List of intervals as returned by get_intervals()
Returns:

features – List of features as returned by get_feature_list()

Return type:

list

deepchem.feat.graph_features.atom_to_id(atom)[source]

Return a unique id corresponding to the atom type

Parameters:atom (RDKit.rdchem.Atom) – Atom to convert to ids.
Returns:id – The index in a feature vector given by the given set of features.
Return type:int

This function helps compute distances between atoms from a given base atom.

deepchem.feat.graph_features.find_distance(a1, num_atoms, canon_adj_list, max_distance=7)[source]

Computes distances from provided atom.

Parameters:
  • a1 (RDKit atom) – The source atom to compute distances from.
  • num_atoms (int) – The total number of atoms.
  • canon_adj_list (list of lists) – canon_adj_list[i] is a list of the atom indices that atom i shares a list. This list is symmetrical so if j in canon_adj_list[i] then i in canon_adj_list[j].
  • max_distance (int, optional (default 7)) – The max distance to search.
Returns:

distances – Of shape (num_atoms, max_distance). Provides a one-hot encoding of the distances. That is, distances[i] is a one-hot encoding of the distance from a1 to atom i.

Return type:

np.ndarray

This function is important and computes per-atom feature vectors used by graph convolutional featurizers.

deepchem.feat.graph_features.atom_features(atom, bool_id_feat=False, explicit_H=False, use_chirality=False)[source]

Helper method used to compute per-atom feature vectors.

Many different featurization methods compute per-atom features such as ConvMolFeaturizer, WeaveFeaturizer. This method computes such features.

Parameters:
  • bool_id_feat (bool, optional) – Return an array of unique identifiers corresponding to atom type.
  • explicit_H (bool, optional) – If true, model hydrogens explicitly
  • use_chirality (bool, optional) – If true, use chirality information.
Returns:

Return type:

np.ndarray of per-atom features.

This function computes the bond features used by graph convolutional featurizers.

deepchem.feat.graph_features.bond_features(bond, use_chirality=False)[source]

Helper method used to compute bond feature vectors.

Many different featurization methods compute bond features such as WeaveFeaturizer. This method computes such features.

Parameters:use_chirality (bool, optional) – If true, use chirality information.

Note

This method requires RDKit to be installed.

Returns:bond_feats – Array of bond features. This is a 1-D array of length 6 if use_chirality is False else of length 10 with chirality encoded.
Return type:np.ndarray

This function computes atom-atom features (for atom pairs which may not have bonds between them.)

deepchem.feat.graph_features.pair_features(mol, edge_list, canon_adj_list, bt_len=6, graph_distance=True)[source]

Helper method used to compute atom pair feature vectors.

Many different featurization methods compute atom pair features such as WeaveFeaturizer. Note that atom pair features could be for pairs of atoms which aren’t necessarily bonded to one another.

Parameters:
  • mol (RDKit Mol) – Molecule to compute features on.
  • edge_list (list) – List of edges to consider
  • canon_adj_list (list of lists) – canon_adj_list[i] is a list of the atom indices that atom i shares a list. This list is symmetrical so if j in canon_adj_list[i] then i in canon_adj_list[j].
  • bt_len (int, optional (default 6)) – The number of different bond types to consider.
  • graph_distance (bool, optional (default True)) – If true, use graph distance between molecules. Else use euclidean distance.

Note

This method requires RDKit to be installed.

Returns:features – Of shape (N, N, bt_len + max_distance + 1). This is the array of pairwise features for all atom pairs.
Return type:np.ndarray

ConvMolFeaturizer

class deepchem.feat.ConvMolFeaturizer(master_atom=False, use_chirality=False, atom_properties=[])[source]

This class implements the featurization to implement Duvenaud graph convolutions.

Duvenaud graph convolutions [1]_ construct a vector of descriptors for each atom in a molecule. The featurizer computes that vector of local descriptors.

References

[1]Duvenaud, David K., et al. “Convolutional networks on graphs for learning molecular fingerprints.” Advances in neural information processing systems. 2015.

Note

This class requires RDKit to be installed.

__init__(master_atom=False, use_chirality=False, atom_properties=[])[source]
Parameters:
  • master_atom (Boolean) – if true create a fake atom with bonds to every other atom. the initialization is the mean of the other atom features in the molecule. This technique is briefly discussed in Neural Message Passing for Quantum Chemistry https://arxiv.org/pdf/1704.01212.pdf
  • use_chirality (Boolean) – if true then make the resulting atom features aware of the chirality of the molecules in question
  • atom_properties (list of string or None) – properties in the RDKit Mol object to use as additional atom-level features in the larger molecular feature. If None, then no atom-level properties are used. Properties should be in the RDKit mol object should be in the form atom XXXXXXXX NAME where XXXXXXXX is a zero-padded 8 digit number coresponding to the zero-indexed atom index of each atom and NAME is the name of the property provided in atom_properties. So “atom 00000000 sasa” would be the name of the molecule level property in mol where the solvent accessible surface area of atom 0 would be stored.
  • ConvMol is an object and not a numpy array, need to set dtype to (Since) –
  • object.
featurize(molecules, log_every_n=1000)[source]

Calculate features for molecules.

Parameters:molecules (RDKit Mol / SMILES string /iterable) – RDKit Mol, or SMILES string or iterable sequence of RDKit mols/SMILES strings.
Returns:
  • A numpy array containing a featurized representation of
  • datapoints.

WeaveFeaturizer

class deepchem.feat.WeaveFeaturizer(graph_distance=True, explicit_H=False, use_chirality=False)[source]

This class implements the featurization to implement Weave convolutions.

Weave convolutions were introduced in [1]_. Unlike Duvenaud graph convolutions, weave convolutions require a quadratic matrix of interaction descriptors for each pair of atoms. These extra descriptors may provide for additional descriptive power but at the cost of a larger featurized dataset.

References

[1]Kearnes, Steven, et al. “Molecular graph convolutions: moving beyond fingerprints.” Journal of computer-aided molecular design 30.8 (2016): 595-608.

Note

This class requires RDKit to be installed.

__init__(graph_distance=True, explicit_H=False, use_chirality=False)[source]
Parameters:
  • graph_distance (bool, optional) – If true, use graph distance. Otherwise, use Euclidean distance.
  • explicit_H (bool, optional) – If true, model hydrogens in the molecule.
  • use_chirality (bool, optional) – If true, use chiral information in the featurization
featurize(molecules, log_every_n=1000)[source]

Calculate features for molecules.

Parameters:molecules (RDKit Mol / SMILES string /iterable) – RDKit Mol, or SMILES string or iterable sequence of RDKit mols/SMILES strings.
Returns:
  • A numpy array containing a featurized representation of
  • datapoints.

CircularFingerprint

class deepchem.feat.CircularFingerprint(radius=2, size=2048, chiral=False, bonds=True, features=False, sparse=False, smiles=False)[source]

Circular (Morgan) fingerprints.

Extended Connectivity Circular Fingerprints compute a bag-of-words style representation of a molecule by breaking it into local neighborhoods and hashing into a bit vector of the specified size. See [1]_ for more details.

Parameters:
  • radius (int, optional (default 2)) – Fingerprint radius.
  • size (int, optional (default 2048)) – Length of generated bit vector.
  • chiral (bool, optional (default False)) – Whether to consider chirality in fingerprint generation.
  • bonds (bool, optional (default True)) – Whether to consider bond order in fingerprint generation.
  • features (bool, optional (default False)) – Whether to use feature information instead of atom information; see RDKit docs for more info.
  • sparse (bool, optional (default False)) – Whether to return a dict for each molecule containing the sparse fingerprint.
  • smiles (bool, optional (default False)) – Whether to calculate SMILES strings for fragment IDs (only applicable when calculating sparse fingerprints).

References

[1]Rogers, David, and Mathew Hahn. “Extended-connectivity fingerprints.” Journal of chemical information and modeling 50.5 (2010): 742-754.

Note

This class requires RDKit to be installed.

__init__(radius=2, size=2048, chiral=False, bonds=True, features=False, sparse=False, smiles=False)[source]

Initialize self. See help(type(self)) for accurate signature.

featurize(molecules, log_every_n=1000)[source]

Calculate features for molecules.

Parameters:molecules (RDKit Mol / SMILES string /iterable) – RDKit Mol, or SMILES string or iterable sequence of RDKit mols/SMILES strings.
Returns:
  • A numpy array containing a featurized representation of
  • datapoints.

RDKitDescriptors

class deepchem.feat.RDKitDescriptors[source]

RDKit descriptors.

This class comptues a list of chemical descriptors using RDKit.

See http://rdkit.org/docs/GettingStartedInPython.html #list-of-available-descriptors.

descriptors

1D array of RDKit descriptor names used in this class.

Type:np.ndarray

Note

This class requires RDKit to be installed.

__init__()[source]

Initialize self. See help(type(self)) for accurate signature.

featurize(molecules, log_every_n=1000)[source]

Calculate features for molecules.

Parameters:molecules (RDKit Mol / SMILES string /iterable) – RDKit Mol, or SMILES string or iterable sequence of RDKit mols/SMILES strings.
Returns:
  • A numpy array containing a featurized representation of
  • datapoints.

CoulombMatrix

class deepchem.feat.CoulombMatrix(max_atoms, remove_hydrogens=False, randomize=False, upper_tri=False, n_samples=1, seed=None)[source]

Calculate Coulomb matrices for molecules.

Coulomb matrices provide a representation of the electronic structure of a molecule. This method is described in [1]_.

Parameters:
  • max_atoms (int) – Maximum number of atoms for any molecule in the dataset. Used to pad the Coulomb matrix.
  • remove_hydrogens (bool, optional (default False)) – Whether to remove hydrogens before constructing Coulomb matrix.
  • randomize (bool, optional (default False)) – Whether to randomize Coulomb matrices to remove dependence on atom index order.
  • upper_tri (bool, optional (default False)) – Whether to return the upper triangular portion of the Coulomb matrix.
  • n_samples (int, optional (default 1)) – Number of random Coulomb matrices to generate if randomize is True.
  • seed (int, optional) – Random seed.

Example

>>> featurizers = dc.feat.CoulombMatrix(max_atoms=23)
>>> input_file = 'deepchem/feat/tests/data/water.sdf' # really backed by water.sdf.csv
>>> tasks = ["atomization_energy"]
>>> loader = dc.data.SDFLoader(tasks, featurizer=featurizers)
>>> dataset = loader.create_dataset(input_file) #doctest: +ELLIPSIS
Reading structures from deepchem/feat/tests/data/water.sdf.

References

[1]Montavon, Grégoire, et al. “Learning invariant representations of molecules for atomization energy prediction.” Advances in neural information processing systems. 2012.

Note

This class requires RDKit to be installed.

__init__(max_atoms, remove_hydrogens=False, randomize=False, upper_tri=False, n_samples=1, seed=None)[source]

Initialize this featurizer.

Parameters:
  • max_atoms (int) – The maximum number of atoms expected for molecules this featurizer will process.
  • remove_hydrogens (bool, optional (default False)) – If True, remove hydrogens before processing them.
  • randomize (bool, optional (default False)) – If True, use method randomize_coulomb_matrices to randomize Coulomb matrices.
  • upper_tri (bool, optional (default False)) – Generate only upper triangle part of Coulomb matrices.
  • n_samples (int, optional (default 1)) – If randomize is set to True, the number of random samples to draw.
  • seed (int, optional (default None)) – Random seed to use.
coulomb_matrix(mol)[source]

Generate Coulomb matrices for each conformer of the given molecule.

Parameters:mol (RDKit Mol) – Molecule.
featurize(molecules, log_every_n=1000)[source]

Calculate features for molecules.

Parameters:molecules (RDKit Mol / SMILES string /iterable) – RDKit Mol, or SMILES string or iterable sequence of RDKit mols/SMILES strings.
Returns:
  • A numpy array containing a featurized representation of
  • datapoints.
static get_interatomic_distances(conf)[source]

Get interatomic distances for atoms in a molecular conformer.

Parameters:conf (RDKit Conformer) – Molecule conformer.
randomize_coulomb_matrix(m)[source]

Randomize a Coulomb matrix as decribed in [1]_:

  1. Compute row norms for M in a vector row_norms.
  2. Sample a zero-mean unit-variance noise vector e with dimension equal to row_norms.
  3. Permute the rows and columns of M with the permutation that sorts row_norms + e.
Parameters:
  • m (ndarray) – Coulomb matrix.
  • n_samples (int, optional (default 1)) – Number of random matrices to generate.
  • seed (int, optional) – Random seed.

References

[1]Montavon et al., New Journal of Physics, 15, (2013), 095003

CoulombMatrixEig

class deepchem.feat.CoulombMatrixEig(max_atoms, remove_hydrogens=False, randomize=False, n_samples=1, seed=None)[source]

Calculate the eigenvalues of Coulomb matrices for molecules.

This featurizer computes the eigenvalues of the Coulomb matrices for provided molecules. Coulomb matrices are described in [1]_.

Parameters:
  • max_atoms (int) – Maximum number of atoms for any molecule in the dataset. Used to pad the Coulomb matrix.
  • remove_hydrogens (bool, optional (default False)) – Whether to remove hydrogens before constructing Coulomb matrix.
  • randomize (bool, optional (default False)) – Whether to randomize Coulomb matrices to remove dependence on atom index order.
  • n_samples (int, optional (default 1)) – Number of random Coulomb matrices to generate if randomize is True.
  • seed (int, optional) – Random seed.

Example

>>> featurizers = dc.feat.CoulombMatrixEig(max_atoms=23)
>>> input_file = 'deepchem/feat/tests/data/water.sdf' # really backed by water.sdf.csv
>>> tasks = ["atomization_energy"]
>>> loader = dc.data.SDFLoader(tasks, featurizer=featurizers)
>>> dataset = loader.create_dataset(input_file) #doctest: +ELLIPSIS
Reading structures from deepchem/feat/tests/data/water.sdf.

References

[1]Montavon, Grégoire, et al. “Learning invariant representations of molecules for atomization energy prediction.” Advances in neural information processing systems. 2012.
__init__(max_atoms, remove_hydrogens=False, randomize=False, n_samples=1, seed=None)[source]

Initialize this featurizer.

Parameters:
  • max_atoms (int) – The maximum number of atoms expected for molecules this featurizer will process.
  • remove_hydrogens (bool, optional (default False)) – If True, remove hydrogens before processing them.
  • randomize (bool, optional (default False)) – If True, use method randomize_coulomb_matrices to randomize Coulomb matrices.
  • n_samples (int, optional (default 1)) – If randomize is set to True, the number of random samples to draw.
  • seed (int, optional (default None)) – Random seed to use.
coulomb_matrix(mol)[source]

Generate Coulomb matrices for each conformer of the given molecule.

Parameters:mol (RDKit Mol) – Molecule.
featurize(molecules, log_every_n=1000)[source]

Calculate features for molecules.

Parameters:molecules (RDKit Mol / SMILES string /iterable) – RDKit Mol, or SMILES string or iterable sequence of RDKit mols/SMILES strings.
Returns:
  • A numpy array containing a featurized representation of
  • datapoints.
static get_interatomic_distances(conf)[source]

Get interatomic distances for atoms in a molecular conformer.

Parameters:conf (RDKit Conformer) – Molecule conformer.
randomize_coulomb_matrix(m)[source]

Randomize a Coulomb matrix as decribed in [1]_:

  1. Compute row norms for M in a vector row_norms.
  2. Sample a zero-mean unit-variance noise vector e with dimension equal to row_norms.
  3. Permute the rows and columns of M with the permutation that sorts row_norms + e.
Parameters:
  • m (ndarray) – Coulomb matrix.
  • n_samples (int, optional (default 1)) – Number of random matrices to generate.
  • seed (int, optional) – Random seed.

References

[1]Montavon et al., New Journal of Physics, 15, (2013), 095003

AtomCoordinates

class deepchem.feat.AtomicCoordinates[source]

Nx3 matrix of Cartesian coordinates [Angstrom]

__init__

Initialize self. See help(type(self)) for accurate signature.

featurize(datapoints, log_every_n=1000)[source]

Calculate features for datapoints.

Parameters:datapoints (iterable) – A sequence of objects that you’d like to featurize. Subclassses of Featurizer should instantiate the _featurize method that featurizes objects in the sequence.
Returns:
Return type:A numpy array containing a featurized representation of datapoints.

AdjacencyFingerprint

class deepchem.feat.AdjacencyFingerprint(n_atom_types=23, max_n_atoms=200, add_hydrogens=False, max_valence=4, num_atoms_feature=False)[source]
__init__(n_atom_types=23, max_n_atoms=200, add_hydrogens=False, max_valence=4, num_atoms_feature=False)[source]

Initialize self. See help(type(self)) for accurate signature.

featurize(rdkit_mols)[source]

Calculate features for datapoints.

Parameters:datapoints (iterable) – A sequence of objects that you’d like to featurize. Subclassses of Featurizer should instantiate the _featurize method that featurizes objects in the sequence.
Returns:
Return type:A numpy array containing a featurized representation of datapoints.

SmilesToSeq

class deepchem.feat.SmilesToSeq(char_to_idx, max_len=250, pad_len=10, **kwargs)[source]

SmilesToSeq Featurizer takes a SMILES string, and turns it into a sequence. Details taken from [1]_.

SMILES strings smaller than a specified max length (max_len) are padded using the PAD token while those larger than the max length are not considered. Based on the paper, there is also the option to add extra padding (pad_len) on both sides of the string after length normalization. Using a character to index (char_to_idx) mapping, the SMILES characters are turned into indices and the resulting sequence of indices serves as the input for an embedding layer.

References

[1]Goh, Garrett B., et al. “Using rule-based labels for weak supervised learning: a ChemNet for transferable chemical property prediction.” Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 2018.

Note

This class requires RDKit to be installed.

__init__(char_to_idx, max_len=250, pad_len=10, **kwargs)[source]

Initialize this class.

Parameters:
  • char_to_idx (dict) – Dictionary containing character to index mappings for unique characters
  • max_len (int, default 250) – Maximum allowed length of the SMILES string
  • pad_len (int, default 10) – Amount of padding to add on either side of the SMILES seq
featurize(molecules, log_every_n=1000)[source]

Calculate features for molecules.

Parameters:molecules (RDKit Mol / SMILES string /iterable) – RDKit Mol, or SMILES string or iterable sequence of RDKit mols/SMILES strings.
Returns:
  • A numpy array containing a featurized representation of
  • datapoints.
remove_pad(characters)[source]

Removes PAD_TOKEN from the character list.

smiles_from_seq(seq)[source]

Reconstructs SMILES string from sequence.

to_seq(smile)[source]

Turns list of smiles characters into array of indices

SmilesToImage

class deepchem.feat.SmilesToImage(img_size=80, res=0.5, max_len=250, img_spec='std', **kwargs)[source]

Convert Smiles string to an image.

SmilesToImage Featurizer takes a SMILES string, and turns it into an image. Details taken from [1]_.

The default size of for the image is 80 x 80. Two image modes are currently supported - std & engd. std is the gray scale specification, with atomic numbers as pixel values for atom positions and a constant value of 2 for bond positions. engd is a 4-channel specification, which uses atom properties like hybridization, valency, charges in addition to atomic number. Bond type is also used for the bonds.

The coordinates of all atoms are computed, and lines are drawn between atoms to indicate bonds. For the respective channels, the atom and bond positions are set to the property values as mentioned in the paper.

References

[1]Goh, Garrett B., et al. “Using rule-based labels for weak supervised learning: a ChemNet for transferable chemical property prediction.” Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 2018.

Note

This class requires RDKit to be installed.

__init__(img_size=80, res=0.5, max_len=250, img_spec='std', **kwargs)[source]
Parameters:
  • img_size (int, default 80) – Size of the image tensor
  • res (float, default 0.5) – Displays the resolution of each pixel in Angstrom
  • max_len (int, default 250) – Maximum allowed length of SMILES string
  • img_spec (str, default std) – Indicates the channel organization of the image tensor
featurize(molecules, log_every_n=1000)[source]

Calculate features for molecules.

Parameters:molecules (RDKit Mol / SMILES string /iterable) – RDKit Mol, or SMILES string or iterable sequence of RDKit mols/SMILES strings.
Returns:
  • A numpy array containing a featurized representation of
  • datapoints.

ComplexFeaturizer

The dc.feat.ComplexFeaturizer class is the abstract parent class for all featurizers that work with three dimensional molecular complexes.

class deepchem.feat.ComplexFeaturizer[source]

” Abstract class for calculating features for mol/protein complexes.

__init__

Initialize self. See help(type(self)) for accurate signature.

featurize_complexes(mol_files, protein_pdbs)[source]

Calculate features for mol/protein complexes.

Parameters:
  • mols (list) – List of PDB filenames for molecules.
  • protein_pdbs (list) – List of PDB filenames for proteins.
Returns:

  • features (np.array) – Array of features
  • failures (list) – Indices of complexes that failed to featurize.

RdkitGridFeaturizer

class deepchem.feat.RdkitGridFeaturizer(nb_rotations=0, feature_types=None, ecfp_degree=2, ecfp_power=3, splif_power=3, box_width=16.0, voxel_width=1.0, flatten=False, verbose=True, sanitize=False, **kwargs)[source]

Featurizes protein-ligand complex using flat features or a 3D grid (in which each voxel is described with a vector of features).

__init__(nb_rotations=0, feature_types=None, ecfp_degree=2, ecfp_power=3, splif_power=3, box_width=16.0, voxel_width=1.0, flatten=False, verbose=True, sanitize=False, **kwargs)[source]
Parameters:
  • nb_rotations (int, optional (default 0)) – Number of additional random rotations of a complex to generate.
  • feature_types (list, optional (default ['ecfp'])) –
    Types of features to calculate. Available types are
    flat features -> ‘ecfp_ligand’, ‘ecfp_hashed’, ‘splif_hashed’, ‘hbond_count’ voxel features -> ‘ecfp’, ‘splif’, ‘sybyl’, ‘salt_bridge’, ‘charge’, ‘hbond’, ‘pi_stack, ‘cation_pi’
    There are also 3 predefined sets of features
    ’flat_combined’, ‘voxel_combined’, and ‘all_combined’.

    Calculated features are concatenated and their order is preserved (features in predefined sets are in alphabetical order).

  • ecfp_degree (int, optional (default 2)) – ECFP radius.
  • ecfp_power (int, optional (default 3)) – Number of bits to store ECFP features (resulting vector will be 2^ecfp_power long)
  • splif_power (int, optional (default 3)) – Number of bits to store SPLIF features (resulting vector will be 2^splif_power long)
  • box_width (float, optional (default 16.0)) – Size of a box in which voxel features are calculated. Box is centered on a ligand centroid.
  • voxel_width (float, optional (default 1.0)) – Size of a 3D voxel in a grid.
  • flatten (bool, optional (defaul False)) – Indicate whether calculated features should be flattened. Output is always flattened if flat features are specified in feature_types.
  • verbose (bool, optional (defaul True)) – Verbolity for logging
  • sanitize (bool, optional (defaul False)) – If set to True molecules will be sanitized. Note that calculating some features (e.g. aromatic interactions) require sanitized molecules.
  • **kwargs (dict, optional) – Keyword arguments can be usaed to specify custom cutoffs and bins (see default values below).
  • cutoffs and bins (Default) –
  • ------------------------
  • hbond_dist_bins ([(2.2, 2.5), (2.5, 3.2), (3.2, 4.0)]) –
  • hbond_angle_cutoffs ([5, 50, 90]) –
  • splif_contact_bins ([(0, 2.0), (2.0, 3.0), (3.0, 4.5)]) –
  • ecfp_cutoff (4.5) –
  • sybyl_cutoff (7.0) –
  • salt_bridges_cutoff (5.0) –
  • pi_stack_dist_cutoff (4.4) –
  • pi_stack_angle_cutoff (30.0) –
  • cation_pi_dist_cutoff (6.5) –
  • cation_pi_angle_cutoff (30.0) –
featurize_complexes(mol_files, protein_pdbs)[source]

Calculate features for mol/protein complexes.

Parameters:
  • mols (list) – List of PDB filenames for molecules.
  • protein_pdbs (list) – List of PDB filenames for proteins.
Returns:

  • features (np.array) – Array of features
  • failures (list) – Indices of complexes that failed to featurize.

AtomConvFeaturizer

class deepchem.feat.NeighborListComplexAtomicCoordinates(max_num_neighbors=None, neighbor_cutoff=4)[source]

Adjacency list of neighbors for protein-ligand complexes in 3-space.

Neighbors dtermined by user-dfined distance cutoff.

__init__(max_num_neighbors=None, neighbor_cutoff=4)[source]

Initialize self. See help(type(self)) for accurate signature.

featurize_complexes(mol_files, protein_pdbs)[source]

Calculate features for mol/protein complexes.

Parameters:
  • mols (list) – List of PDB filenames for molecules.
  • protein_pdbs (list) – List of PDB filenames for proteins.
Returns:

  • features (np.array) – Array of features
  • failures (list) – Indices of complexes that failed to featurize.

MaterialsFeaturizers

Materials Featurizers are those that work with datasets of inorganic crystals. These featurizers operate on chemical compositions (e.g. “MoS2”), or on a lattice and 3D coordinates that specify a periodic crystal structure. They should be applied on systems that have periodic boundary conditions. Materials featurizers are not designed to work with molecules.

ElementPropertyFingerprint

class deepchem.feat.ElementPropertyFingerprint(data_source='matminer')[source]

Fingerprint of elemental properties from composition.

Based on the data source chosen, returns properties and statistics (min, max, range, mean, standard deviation, mode) for a compound based on elemental stoichiometry. E.g., the average electronegativity of atoms in a crystal structure. The chemical fingerprint is a vector of these statistics. For a full list of properties and statistics, see matminer.featurizers.composition.ElementProperty(data_source).feature_labels().

This featurizer requires the optional dependencies pymatgen and matminer. It may be useful when only crystal compositions are available (and not 3D coordinates).

See references [1]_ [2] [3] [4] for more details.

References

[1]MagPie data: Ward, L. et al. npj Comput Mater 2, 16028 (2016). https://doi.org/10.1038/npjcompumats.2016.28
[2]Deml data: Deml, A. et al. Physical Review B 93, 085142 (2016). 10.1103/PhysRevB.93.085142
[3]Matminer: Ward, L. et al. Comput. Mater. Sci. 152, 60-69 (2018).
[4]Pymatgen: Ong, S.P. et al. Comput. Mater. Sci. 68, 314-319 (2013).
__init__(data_source='matminer')[source]
Parameters:data_source ({"matminer", "magpie", "deml"}) – Source for element property data.
featurize(datapoints, log_every_n=1000)[source]

Calculate features for datapoints.

Parameters:datapoints (iterable) – A sequence of objects that you’d like to featurize. Subclassses of Featurizer should instantiate the _featurize method that featurizes objects in the sequence.
Returns:
Return type:A numpy array containing a featurized representation of datapoints.

SineCoulombMatrix

class deepchem.feat.SineCoulombMatrix(max_atoms, flatten=True)[source]

Calculate sine Coulomb matrix for crystals.

A variant of Coulomb matrix for periodic crystals.

The sine Coulomb matrix is identical to the Coulomb matrix, except that the inverse distance function is replaced by the inverse of sin**2 of the vector between sites which are periodic in the dimensions of the crystal lattice.

Features are flattened into a vector of matrix eigenvalues by default for ML-readiness. To ensure that all feature vectors are equal length, the maximum number of atoms (eigenvalues) in the input dataset must be specified.

This featurizer requires the optional dependencies pymatgen and matminer. It may be useful when crystal structures with 3D coordinates are available.

See [1]_ for more details.

References

[1]Faber et al. Inter. J. Quantum Chem. 115, 16, 2015.
__init__(max_atoms, flatten=True)[source]
Parameters:
  • max_atoms (int) – Maximum number of atoms for any crystal in the dataset. Used to pad the Coulomb matrix.
  • flatten (bool (default True)) – Return flattened vector of matrix eigenvalues.
featurize(datapoints, log_every_n=1000)[source]

Calculate features for datapoints.

Parameters:datapoints (iterable) – A sequence of objects that you’d like to featurize. Subclassses of Featurizer should instantiate the _featurize method that featurizes objects in the sequence.
Returns:
Return type:A numpy array containing a featurized representation of datapoints.

StructureGraphFeaturizer

class deepchem.feat.StructureGraphFeaturizer(radius=8.0, max_neighbors=12, step=0.2)[source]

Calculate structure graph features for crystals.

Based on the implementation in Crystal Graph Convolutional Neural Networks (CGCNN). The method constructs a crystal graph representation including atom features (atomic numbers) and bond features (neighbor distances). Neighbors are determined by searching in a sphere around atoms in the unit cell. A Gaussian filter is applied to neighbor distances. All units are in angstrom.

This featurizer requires the optional dependency pymatgen. It may be useful when 3D coordinates are available and when using graph network models and crystal graph convolutional networks.

See [1]_ for more details.

References

[1]
  1. Xie and J. C. Grossman, Phys. Rev. Lett. 120, 2018.
__init__(radius=8.0, max_neighbors=12, step=0.2)[source]
Parameters:
  • radius (float (default 8.0)) – Radius of sphere for finding neighbors of atoms in unit cell.
  • max_neighbors (int (default 12)) – Maximum number of neighbors to consider when constructing graph.
  • step (float (default 0.2)) – Step size for Gaussian filter.
featurize(datapoints, log_every_n=1000)[source]

Calculate features for datapoints.

Parameters:datapoints (iterable) – A sequence of objects that you’d like to featurize. Subclassses of Featurizer should instantiate the _featurize method that featurizes objects in the sequence.
Returns:
Return type:A numpy array containing a featurized representation of datapoints.

BindingPocketFeaturizer

class deepchem.feat.BindingPocketFeaturizer[source]

Featurizes binding pockets with information about chemical environments.

In many applications, it’s desirable to look at binding pockets on macromolecules which may be good targets for potential ligands or other molecules to interact with. A BindingPocketFeaturizer expects to be given a macromolecule, and a list of pockets to featurize on that macromolecule. These pockets should be of the form produced by a dc.dock.BindingPocketFinder, that is as a list of dc.utils.CoordinateBox objects.

The base featurization in this class’s featurization is currently very simple and counts the number of residues of each type present in the pocket. It’s likely that you’ll want to overwrite this implementation for more sophisticated downstream usecases. Note that this class’s implementation will only work for proteins and not for other macromolecules

__init__

Initialize self. See help(type(self)) for accurate signature.

featurize(protein_file, pockets)[source]

Calculate atomic coodinates.

Parameters:
  • protein_file (str) – Location of PDB file. Will be loaded by MDTraj
  • pockets (list[CoordinateBox]) – List of dc.utils.CoordinateBox objects.
Returns:

Return type:

A numpy array of shale (len(pockets), n_residues)

UserDefinedFeaturizer

class deepchem.feat.UserDefinedFeaturizer(feature_fields)[source]

Directs usage of user-computed featurizations.

__init__(feature_fields)[source]

Creates user-defined-featurizer.

featurize(datapoints, log_every_n=1000)[source]

Calculate features for datapoints.

Parameters:datapoints (iterable) – A sequence of objects that you’d like to featurize. Subclassses of Featurizer should instantiate the _featurize method that featurizes objects in the sequence.
Returns:
Return type:A numpy array containing a featurized representation of datapoints.

BPSymmetryFunctionInput

class deepchem.feat.BPSymmetryFunctionInput(max_atoms)[source]

Calculate Symmetry Function for each atom in the molecules

This method is described in [1]_

References

[1]Behler, Jörg, and Michele Parrinello. “Generalized neural-network representation of high-dimensional potential-energy surfaces.” Physical review letters 98.14 (2007): 146401.

Note

This class requires RDKit to be installed.

__init__(max_atoms)[source]

Initialize this featurizer.

Parameters:max_atoms (int) – The maximum number of atoms expected for molecules this featurizer will process.
featurize(molecules, log_every_n=1000)[source]

Calculate features for molecules.

Parameters:molecules (RDKit Mol / SMILES string /iterable) – RDKit Mol, or SMILES string or iterable sequence of RDKit mols/SMILES strings.
Returns:
  • A numpy array containing a featurized representation of
  • datapoints.

OneHotFeaturizer

class deepchem.feat.OneHotFeaturizer(charset=None, padlength=120)[source]

Encodes a molecule as a one-hot array.

This featurizer takes a molecule and encodes its Smiles string as a one-hot array.

Note

This class requires RDKit to be installed. Note that this featurizer is not Thread Safe in initialization of charset

__init__(charset=None, padlength=120)[source]

Initialize featurizer.

Parameters:
  • charset (list of str, optional (default None)) – A list of strings, where each string is length 1.
  • padlength (int, optional (default 120)) – length to pad the smile strings to.
featurize(molecules, log_every_n=1000)[source]

Calculate features for molecules.

Parameters:molecules (RDKit Mol / SMILES string /iterable) – RDKit Mol, or SMILES string or iterable sequence of RDKit mols/SMILES strings.
Returns:
  • A numpy array containing a featurized representation of
  • datapoints.
one_hot_array(i)[source]

Create a one hot array with bit i set to 1

Parameters:i (int) – bit to set to 1
Returns:obj – length len(self.charset)
Return type:list of obj:int
one_hot_encoded(smile)[source]

One Hot Encode an entire SMILE string

Parameters:smile (str) – smile string to encode
Returns:
Return type:np.array of one hot encoded arrays for each character in smile
one_hot_index(c)[source]

Compute one-hot index of charater.

Parameters:c (char) – character whose index we want
Returns:
Return type:index of c in self.charset
pad_smile(smile)[source]

Pad a smile string to self.pad_length

Parameters:smile (str) – The smiles string to be padded.
Returns:smile string space padded to self.pad_length
Return type:str
untransform(z)[source]

Convert from one hot representation back to SMILE

Parameters:z (obj:list) – list of one hot encoded features
Returns:
Return type:Smile Strings picking MAX for each one hot encoded array

RawFeaturizer

class deepchem.feat.RawFeaturizer(smiles=False)[source]

Encodes a molecule as a SMILES string or RDKit mol.

This featurizer can be useful when you’re trying to transform a large collection of RDKit mol objects as Smiles strings, or alternatively as a “no-op” featurizer in your molecular pipeline.

Note

This class requires RDKit to be installed.

__init__(smiles=False)[source]

Initialize this featurizer.

Parameters:smiles (bool, optional (default False)) – If True, encode this molecule as a SMILES string. Else as a RDKit mol.
featurize(molecules, log_every_n=1000)[source]

Calculate features for molecules.

Parameters:molecules (RDKit Mol / SMILES string /iterable) – RDKit Mol, or SMILES string or iterable sequence of RDKit mols/SMILES strings.
Returns:
  • A numpy array containing a featurized representation of
  • datapoints.