Featurizers

DeepChem contains an extensive collection of featurizers. If you haven’t run into this terminology before, a “featurizer” is chunk of code which transforms raw input data into a processed form suitable for machine learning. Machine learning methods often need data to be pre-chewed for them to process. Think of this like a mama penguin chewing up food so the baby penguin can digest it easily.

Now if you’ve watched a few introductory deep learning lectures, you might ask, why do we need something like a featurizer? Isn’t part of the promise of deep learning that we can learn patterns directly from raw data?

Unfortunately it turns out that deep learning techniques need featurizers just like normal machine learning methods do. Arguably, they are less dependent on sophisticated featurizers and more capable of learning sophisticated patterns from simpler data. But nevertheless, deep learning systems can’t simply chew up raw files. For this reason, deepchem provides an extensive collection of featurization methods which we will review on this page.

Molecule Featurizers

These featurizers work with datasets of molecules.

Graph Convolution Featurizers

We are simplifying our graph convolution models by a joint data representation (GraphData) in a future version of DeepChem, so we provide several featurizers.

ConvMolFeaturizer and WeaveFeaturizer are used with graph convolution models which inherited KerasModel. ConvMolFeaturizer is used with graph convolution models except WeaveModel. WeaveFeaturizer are only used with WeaveModel. On the other hand, MolGraphConvFeaturizer is used with graph convolution models which inherited TorchModel. MolGanFeaturizer will be used with MolGAN model, a GAN model for generation of small molecules.

ConvMolFeaturizer

class ConvMolFeaturizer(master_atom: bool = False, use_chirality: bool = False, atom_properties: Iterable[str] = [], per_atom_fragmentation: bool = False)[source]

This class implements the featurization to implement Duvenaud graph convolutions.

Duvenaud graph convolutions [1]_ construct a vector of descriptors for each atom in a molecule. The featurizer computes that vector of local descriptors.

Examples

>>> import deepchem as dc
>>> smiles = ["C", "CCC"]
>>> featurizer=dc.feat.ConvMolFeaturizer(per_atom_fragmentation=False)
>>> f = featurizer.featurize(smiles)
>>> # Using ConvMolFeaturizer to create featurized fragments derived from molecules of interest.
... # This is used only in the context of performing interpretation of models using atomic
... # contributions (atom-based model interpretation)
... smiles = ["C", "CCC"]
>>> featurizer=dc.feat.ConvMolFeaturizer(per_atom_fragmentation=True)
>>> f = featurizer.featurize(smiles)
>>> len(f) # contains 2 lists with  featurized fragments from 2 mols
2

See also

Detailed

References

1

Duvenaud, David K., et al. “Convolutional networks on graphs for learning molecular fingerprints.” Advances in neural information processing systems. 2015.

Note

This class requires RDKit to be installed.

__init__(master_atom: bool = False, use_chirality: bool = False, atom_properties: Iterable[str] = [], per_atom_fragmentation: bool = False)[source]
Parameters
  • master_atom (Boolean) – if true create a fake atom with bonds to every other atom. the initialization is the mean of the other atom features in the molecule. This technique is briefly discussed in Neural Message Passing for Quantum Chemistry https://arxiv.org/pdf/1704.01212.pdf

  • use_chirality (Boolean) – if true then make the resulting atom features aware of the chirality of the molecules in question

  • atom_properties (list of string or None) – properties in the RDKit Mol object to use as additional atom-level features in the larger molecular feature. If None, then no atom-level properties are used. Properties should be in the RDKit mol object should be in the form atom XXXXXXXX NAME where XXXXXXXX is a zero-padded 8 digit number coresponding to the zero-indexed atom index of each atom and NAME is the name of the property provided in atom_properties. So “atom 00000000 sasa” would be the name of the molecule level property in mol where the solvent accessible surface area of atom 0 would be stored.

  • per_atom_fragmentation (Boolean) – If True, then multiple “atom-depleted” versions of each molecule will be created (using featurize() method). For each molecule, atoms are removed one at a time and the resulting molecule is featurized. The result is a list of ConvMol objects, one with each heavy atom removed. This is useful for subsequent model interpretation: finding atoms favorable/unfavorable for (modelled) activity. This option is typically used in combination with a FlatteningTransformer to split the lists into separate samples.

  • ConvMol is an object and not a numpy array (Since) –

  • to set dtype to (need) –

  • object.

featurize(molecules: Union[Any, str, Iterable[Any], Iterable[str]], log_every_n: int = 1000)numpy.ndarray[source]

Override parent: aim is to add handling atom-depleted molecules featurization

Parameters
  • molecules (rdkit.Chem.rdchem.Mol / SMILES string / iterable) – RDKit Mol, or SMILES string or iterable sequence of RDKit mols/SMILES strings.

  • log_every_n (int, default 1000) – Logging messages reported every log_every_n samples.

Returns

features – A numpy array containing a featurized representation of datapoints.

Return type

np.ndarray

WeaveFeaturizer

class WeaveFeaturizer(graph_distance: bool = True, explicit_H: bool = False, use_chirality: bool = False, max_pair_distance: Optional[int] = None)[source]

This class implements the featurization to implement Weave convolutions.

Weave convolutions were introduced in [1]_. Unlike Duvenaud graph convolutions, weave convolutions require a quadratic matrix of interaction descriptors for each pair of atoms. These extra descriptors may provide for additional descriptive power but at the cost of a larger featurized dataset.

Examples

>>> import deepchem as dc
>>> mols = ["CCC"]
>>> featurizer = dc.feat.WeaveFeaturizer()
>>> features = featurizer.featurize(mols)
>>> type(features[0])
<class 'deepchem.feat.mol_graphs.WeaveMol'>
>>> features[0].get_num_atoms() # 3 atoms in compound
3
>>> features[0].get_num_features() # feature size
75
>>> type(features[0].get_atom_features())
<class 'numpy.ndarray'>
>>> features[0].get_atom_features().shape
(3, 75)
>>> type(features[0].get_pair_features())
<class 'numpy.ndarray'>
>>> features[0].get_pair_features().shape
(9, 14)

References

1

Kearnes, Steven, et al. “Molecular graph convolutions: moving beyond fingerprints.” Journal of computer-aided molecular design 30.8 (2016): 595-608.

Note

This class requires RDKit to be installed.

__init__(graph_distance: bool = True, explicit_H: bool = False, use_chirality: bool = False, max_pair_distance: Optional[int] = None)[source]

Initialize this featurizer with set parameters.

Parameters
  • graph_distance (bool, (default True)) – If True, use graph distance for distance features. Otherwise, use Euclidean distance. Note that this means that molecules that this featurizer is invoked on must have valid conformer information if this option is set.

  • explicit_H (bool, (default False)) – If true, model hydrogens in the molecule.

  • use_chirality (bool, (default False)) – If true, use chiral information in the featurization

  • max_pair_distance (Optional[int], (default None)) – This value can be a positive integer or None. This parameter determines the maximum graph distance at which pair features are computed. For example, if max_pair_distance==2, then pair features are computed only for atoms at most graph distance 2 apart. If max_pair_distance is None, all pairs are considered (effectively infinite max_pair_distance)

featurize(molecules, log_every_n=1000)numpy.ndarray[source]

Calculate features for molecules.

Parameters
  • molecules (rdkit.Chem.rdchem.Mol / SMILES string / iterable) – RDKit Mol, or SMILES string or iterable sequence of RDKit mols/SMILES strings.

  • log_every_n (int, default 1000) – Logging messages reported every log_every_n samples.

Returns

features – A numpy array containing a featurized representation of datapoints.

Return type

np.ndarray

MolGanFeaturizer

class MolGanFeaturizer(max_atom_count: int = 9, kekulize: bool = True, bond_labels: Optional[List[Any]] = None, atom_labels: Optional[List[int]] = None)[source]

Featurizer for MolGAN de-novo molecular generation [1]_. The default representation is in form of GraphMatrix object. It is wrapper for two matrices containing atom and bond type information. The class also provides reverse capabilities.

Examples

>>> import deepchem as dc
>>> from rdkit import Chem
>>> rdkit_mol, smiles_mol = Chem.MolFromSmiles('CCC'), 'C1=CC=CC=C1'
>>> molecules = [rdkit_mol, smiles_mol]
>>> featurizer = dc.feat.MolGanFeaturizer()
>>> features = featurizer.featurize(molecules)
>>> len(features) # 2 molecules
2
>>> type(features[0])
<class 'deepchem.feat.molecule_featurizers.molgan_featurizer.GraphMatrix'>
>>> molecules = featurizer.defeaturize(features) # defeaturization
>>> type(molecules[0])
<class 'rdkit.Chem.rdchem.Mol'>
__init__(max_atom_count: int = 9, kekulize: bool = True, bond_labels: Optional[List[Any]] = None, atom_labels: Optional[List[int]] = None)[source]
Parameters
  • max_atom_count (int, default 9) – Maximum number of atoms used for creation of adjacency matrix. Molecules cannot have more atoms than this number Implicit hydrogens do not count.

  • kekulize (bool, default True) – Should molecules be kekulized. Solves number of issues with defeaturization when used.

  • bond_labels (List[RDKitBond]) – List of types of bond used for generation of adjacency matrix

  • atom_labels (List[int]) – List of atomic numbers used for generation of node features

References

1

Nicola De Cao et al. “MolGAN: An implicit generative model for small molecular graphs” (2018), https://arxiv.org/abs/1805.11973

featurize(molecules, log_every_n=1000)numpy.ndarray[source]

Calculate features for molecules.

Parameters
  • molecules (rdkit.Chem.rdchem.Mol / SMILES string / iterable) – RDKit Mol, or SMILES string or iterable sequence of RDKit mols/SMILES strings.

  • log_every_n (int, default 1000) – Logging messages reported every log_every_n samples.

Returns

features – A numpy array containing a featurized representation of datapoints.

Return type

np.ndarray

defeaturize(graphs: Union[deepchem.feat.molecule_featurizers.molgan_featurizer.GraphMatrix, Sequence[deepchem.feat.molecule_featurizers.molgan_featurizer.GraphMatrix]], log_every_n: int = 1000)numpy.ndarray[source]

Calculates molecules from corresponding GraphMatrix objects.

Parameters
  • graphs (GraphMatrix / iterable) – GraphMatrix object or corresponding iterable

  • log_every_n (int, default 1000) – Logging messages reported every log_every_n samples.

Returns

features – A numpy array containing RDKitMol objext.

Return type

np.ndarray

MolGraphConvFeaturizer

class MolGraphConvFeaturizer(use_edges: bool = False, use_chirality: bool = False, use_partial_charge: bool = False)[source]

This class is a featurizer of general graph convolution networks for molecules.

The default node(atom) and edge(bond) representations are based on WeaveNet paper. If you want to use your own representations, you could use this class as a guide to define your original Featurizer. In many cases, it’s enough to modify return values of construct_atom_feature or construct_bond_feature.

The default node representation are constructed by concatenating the following values, and the feature length is 30.

  • Atom type: A one-hot vector of this atom, “C”, “N”, “O”, “F”, “P”, “S”, “Cl”, “Br”, “I”, “other atoms”.

  • Formal charge: Integer electronic charge.

  • Hybridization: A one-hot vector of “sp”, “sp2”, “sp3”.

  • Hydrogen bonding: A one-hot vector of whether this atom is a hydrogen bond donor or acceptor.

  • Aromatic: A one-hot vector of whether the atom belongs to an aromatic ring.

  • Degree: A one-hot vector of the degree (0-5) of this atom.

  • Number of Hydrogens: A one-hot vector of the number of hydrogens (0-4) that this atom connected.

  • Chirality: A one-hot vector of the chirality, “R” or “S”. (Optional)

  • Partial charge: Calculated partial charge. (Optional)

The default edge representation are constructed by concatenating the following values, and the feature length is 11.

  • Bond type: A one-hot vector of the bond type, “single”, “double”, “triple”, or “aromatic”.

  • Same ring: A one-hot vector of whether the atoms in the pair are in the same ring.

  • Conjugated: A one-hot vector of whether this bond is conjugated or not.

  • Stereo: A one-hot vector of the stereo configuration of a bond.

If you want to know more details about features, please check the paper [1]_ and utilities in deepchem.utils.molecule_feature_utils.py.

Examples

>>> smiles = ["C1CCC1", "C1=CC=CN=C1"]
>>> featurizer = MolGraphConvFeaturizer(use_edges=True)
>>> out = featurizer.featurize(smiles)
>>> type(out[0])
<class 'deepchem.feat.graph_data.GraphData'>
>>> out[0].num_node_features
30
>>> out[0].num_edge_features
11

References

1

Kearnes, Steven, et al. “Molecular graph convolutions: moving beyond fingerprints.” Journal of computer-aided molecular design 30.8 (2016):595-608.

Note

This class requires RDKit to be installed.

__init__(use_edges: bool = False, use_chirality: bool = False, use_partial_charge: bool = False)[source]
Parameters
  • use_edges (bool, default False) – Whether to use edge features or not.

  • use_chirality (bool, default False) – Whether to use chirality information or not. If True, featurization becomes slow.

  • use_partial_charge (bool, default False) – Whether to use partial charge data or not. If True, this featurizer computes gasteiger charges. Therefore, there is a possibility to fail to featurize for some molecules and featurization becomes slow.

featurize(molecules, log_every_n=1000)numpy.ndarray[source]

Calculate features for molecules.

Parameters
  • molecules (rdkit.Chem.rdchem.Mol / SMILES string / iterable) – RDKit Mol, or SMILES string or iterable sequence of RDKit mols/SMILES strings.

  • log_every_n (int, default 1000) – Logging messages reported every log_every_n samples.

Returns

features – A numpy array containing a featurized representation of datapoints.

Return type

np.ndarray

PagtnMolGraphFeaturizer

class PagtnMolGraphFeaturizer(max_length=5)[source]

This class is a featuriser of PAGTN graph networks for molecules.

The featurization is based on PAGTN model. It is slightly more computationally intensive than default Graph Convolution Featuriser, but it builds a Molecular Graph connecting all atom pairs accounting for interactions of an atom with every other atom in the Molecule. According to the paper, interactions between two pairs of atom are dependent on the relative distance between them and and hence, the function needs to calculate the shortest path between them.

The default node representation is constructed by concatenating the following values, and the feature length is 94.

  • Atom type: One hot encoding of the atom type. It consists of the most possible elements in a chemical compound.

  • Formal charge: One hot encoding of formal charge of the atom.

  • Degree: One hot encoding of the atom degree

  • Explicit Valence: One hot encoding of explicit valence of an atom. The supported possibilities include 0 - 6.

  • Implicit Valence: One hot encoding of implicit valence of an atom. The supported possibilities include 0 - 5.

  • Aromaticity: Boolean representing if an atom is aromatic.

The default edge representation is constructed by concatenating the following values, and the feature length is 42. It builds a complete graph where each node is connected to every other node. The edge representations are calculated based on the shortest path between two nodes (choose any one if multiple exist). Each bond encountered in the shortest path is used to calculate edge features.

  • Bond type: A one-hot vector of the bond type, “single”, “double”, “triple”, or “aromatic”.

  • Conjugated: A one-hot vector of whether this bond is conjugated or not.

  • Same ring: A one-hot vector of whether the atoms in the pair are in the same ring.

  • Ring Size and Aromaticity: One hot encoding of atoms in pair based on ring size and aromaticity.

  • Distance: One hot encoding of the distance between pair of atoms.

Examples

>>> from deepchem.feat import PagtnMolGraphFeaturizer
>>> smiles = ["C1CCC1", "C1=CC=CN=C1"]
>>> featurizer = PagtnMolGraphFeaturizer(max_length=5)
>>> out = featurizer.featurize(smiles)
>>> type(out[0])
<class 'deepchem.feat.graph_data.GraphData'>
>>> out[0].num_node_features
94
>>> out[0].num_edge_features
42

References

1

Chen, Barzilay, Jaakkola “Path-Augmented Graph Transformer Network” 10.26434/chemrxiv.8214422.

Note

This class requires RDKit to be installed.

__init__(max_length=5)[source]
Parameters

max_length (int) – Maximum distance up to which shortest paths must be considered. Paths shorter than max_length will be padded and longer will be truncated, default to 5.

featurize(molecules, log_every_n=1000)numpy.ndarray[source]

Calculate features for molecules.

Parameters
  • molecules (rdkit.Chem.rdchem.Mol / SMILES string / iterable) – RDKit Mol, or SMILES string or iterable sequence of RDKit mols/SMILES strings.

  • log_every_n (int, default 1000) – Logging messages reported every log_every_n samples.

Returns

features – A numpy array containing a featurized representation of datapoints.

Return type

np.ndarray

Utilities

Here are some constants that are used by the graph convolutional featurizers for molecules.

class GraphConvConstants[source]

This class defines a collection of constants which are useful for graph convolutions on molecules.

possible_atom_list = ['C', 'N', 'O', 'S', 'F', 'P', 'Cl', 'Mg', 'Na', 'Br', 'Fe', 'Ca', 'Cu', 'Mc', 'Pd', 'Pb', 'K', 'I', 'Al', 'Ni', 'Mn'][source]

Allowed Numbers of Hydrogens

possible_numH_list = [0, 1, 2, 3, 4][source]

Allowed Valences for Atoms

possible_valence_list = [0, 1, 2, 3, 4, 5, 6][source]

Allowed Formal Charges for Atoms

possible_formal_charge_list = [-3, -2, -1, 0, 1, 2, 3][source]

This is a placeholder for documentation. These will be replaced with corresponding values of the rdkit HybridizationType

possible_hybridization_list = ['SP', 'SP2', 'SP3', 'SP3D', 'SP3D2'][source]

Allowed number of radical electrons.

possible_number_radical_e_list = [0, 1, 2][source]

Allowed types of Chirality

possible_chirality_list = ['R', 'S'][source]

The set of all values allowed.

reference_lists = [['C', 'N', 'O', 'S', 'F', 'P', 'Cl', 'Mg', 'Na', 'Br', 'Fe', 'Ca', 'Cu', 'Mc', 'Pd', 'Pb', 'K', 'I', 'Al', 'Ni', 'Mn'], [0, 1, 2, 3, 4], [0, 1, 2, 3, 4, 5, 6], [-3, -2, -1, 0, 1, 2, 3], [0, 1, 2], ['SP', 'SP2', 'SP3', 'SP3D', 'SP3D2'], ['R', 'S']][source]

The number of different values that can be taken. See get_intervals()

intervals = [1, 6, 48, 384, 1536, 9216, 27648][source]

Possible stereochemistry. We use E-Z notation for stereochemistry https://en.wikipedia.org/wiki/E%E2%80%93Z_notation

possible_bond_stereo = ['STEREONONE', 'STEREOANY', 'STEREOZ', 'STEREOE'][source]

Number of different bond types not counting stereochemistry.

bond_fdim_base = 6[source]
__module__ = 'deepchem.feat.graph_features'[source]

There are a number of helper methods used by the graph convolutional classes which we document here.

one_of_k_encoding(x, allowable_set)[source]

Encodes elements of a provided set as integers.

Parameters
  • x (object) – Must be present in allowable_set.

  • allowable_set (list) – List of allowable quantities.

Example

>>> import deepchem as dc
>>> dc.feat.graph_features.one_of_k_encoding("a", ["a", "b", "c"])
[True, False, False]
Raises

ValueError

one_of_k_encoding_unk(x, allowable_set)[source]

Maps inputs not in the allowable set to the last element.

Unlike one_of_k_encoding, if x is not in allowable_set, this method pretends that x is the last element of allowable_set.

Parameters
  • x (object) – Must be present in allowable_set.

  • allowable_set (list) – List of allowable quantities.

Examples

>>> dc.feat.graph_features.one_of_k_encoding_unk("s", ["a", "b", "c"])
[False, False, True]
get_intervals(l)[source]

For list of lists, gets the cumulative products of the lengths

Note that we add 1 to the lengths of all lists (to avoid an empty list propagating a 0).

Parameters

l (list of lists) – Returns the cumulative product of these lengths.

Examples

>>> dc.feat.graph_features.get_intervals([[1], [1, 2], [1, 2, 3]])
[1, 3, 12]
>>> dc.feat.graph_features.get_intervals([[1], [], [1, 2], [1, 2, 3]])
[1, 1, 3, 12]
safe_index(l, e)[source]

Gets the index of e in l, providing an index of len(l) if not found

Parameters
  • l (list) – List of values

  • e (object) – Object to check whether e is in l

Examples

>>> dc.feat.graph_features.safe_index([1, 2, 3], 1)
0
>>> dc.feat.graph_features.safe_index([1, 2, 3], 7)
3
get_feature_list(atom)[source]

Returns a list of possible features for this atom.

Parameters

atom (RDKit.Chem.rdchem.Atom) – Atom to get features for

Examples

>>> from rdkit import Chem
>>> mol = Chem.MolFromSmiles("C")
>>> atom = mol.GetAtoms()[0]
>>> features = dc.feat.graph_features.get_feature_list(atom)
>>> type(features)
<class 'list'>
>>> len(features)
6

Note

This method requires RDKit to be installed.

Returns

features – List of length 6. The i-th value in this list provides the index of the atom in the corresponding feature value list. The 6 feature values lists for this function are [GraphConvConstants.possible_atom_list, GraphConvConstants.possible_numH_list, GraphConvConstants.possible_valence_list, GraphConvConstants.possible_formal_charge_list, GraphConvConstants.possible_num_radical_e_list].

Return type

list

features_to_id(features, intervals)[source]

Convert list of features into index using spacings provided in intervals

Parameters
  • features (list) – List of features as returned by get_feature_list()

  • intervals (list) – List of intervals as returned by get_intervals()

Returns

id – The index in a feature vector given by the given set of features.

Return type

int

id_to_features(id, intervals)[source]

Given an index in a feature vector, return the original set of features.

Parameters
  • id (int) – The index in a feature vector given by the given set of features.

  • intervals (list) – List of intervals as returned by get_intervals()

Returns

features – List of features as returned by get_feature_list()

Return type

list

atom_to_id(atom)[source]

Return a unique id corresponding to the atom type

Parameters

atom (RDKit.Chem.rdchem.Atom) – Atom to convert to ids.

Returns

id – The index in a feature vector given by the given set of features.

Return type

int

This function helps compute distances between atoms from a given base atom.

find_distance(a1: Any, num_atoms: int, bond_adj_list, max_distance=7)numpy.ndarray[source]

Computes distances from provided atom.

Parameters
  • a1 (RDKit atom) – The source atom to compute distances from.

  • num_atoms (int) – The total number of atoms.

  • bond_adj_list (list of lists) – bond_adj_list[i] is a list of the atom indices that atom i shares a bond with. This list is symmetrical so if j in bond_adj_list[i] then i in bond_adj_list[j].

  • max_distance (int, optional (default 7)) – The max distance to search.

Returns

distances – Of shape (num_atoms, max_distance). Provides a one-hot encoding of the distances. That is, distances[i] is a one-hot encoding of the distance from a1 to atom i.

Return type

np.ndarray

This function is important and computes per-atom feature vectors used by graph convolutional featurizers.

atom_features(atom, bool_id_feat=False, explicit_H=False, use_chirality=False)[source]

Helper method used to compute per-atom feature vectors.

Many different featurization methods compute per-atom features such as ConvMolFeaturizer, WeaveFeaturizer. This method computes such features.

Parameters
  • atom (RDKit.Chem.rdchem.Atom) – Atom to compute features on.

  • bool_id_feat (bool, optional) – Return an array of unique identifiers corresponding to atom type.

  • explicit_H (bool, optional) – If true, model hydrogens explicitly

  • use_chirality (bool, optional) – If true, use chirality information.

Returns

features – An array of per-atom features.

Return type

np.ndarray

Examples

>>> from rdkit import Chem
>>> mol = Chem.MolFromSmiles('CCC')
>>> atom = mol.GetAtoms()[0]
>>> features = dc.feat.graph_features.atom_features(atom)
>>> type(features)
<class 'numpy.ndarray'>
>>> features.shape
(75,)

This function computes the bond features used by graph convolutional featurizers.

bond_features(bond, use_chirality=False)[source]

Helper method used to compute bond feature vectors.

Many different featurization methods compute bond features such as WeaveFeaturizer. This method computes such features.

Parameters
  • bond (rdkit.Chem.rdchem.Bond) – Bond to compute features on.

  • use_chirality (bool, optional) – If true, use chirality information.

Note

This method requires RDKit to be installed.

Returns

bond_feats – Array of bond features. This is a 1-D array of length 6 if use_chirality is False else of length 10 with chirality encoded.

Return type

np.ndarray

Examples

>>> from rdkit import Chem
>>> mol = Chem.MolFromSmiles('CCC')
>>> bond = mol.GetBonds()[0]
>>> bond_features = dc.feat.graph_features.bond_features(bond)
>>> type(bond_features)
<class 'numpy.ndarray'>
>>> bond_features.shape
(6,)

Note

This method requires RDKit to be installed.

This function computes atom-atom features (for atom pairs which may not have bonds between them.)

pair_features(mol: Any, bond_features_map: dict, bond_adj_list: List, bt_len: int = 6, graph_distance: bool = True, max_pair_distance: Optional[int] = None)numpy.ndarray[source]

Helper method used to compute atom pair feature vectors.

Many different featurization methods compute atom pair features such as WeaveFeaturizer. Note that atom pair features could be for pairs of atoms which aren’t necessarily bonded to one another.

Parameters
  • mol (RDKit Mol) – Molecule to compute features on.

  • bond_features_map (dict) – Dictionary that maps pairs of atom ids (say (2, 3) for a bond between atoms 2 and 3) to the features for the bond between them.

  • bond_adj_list (list of lists) – bond_adj_list[i] is a list of the atom indices that atom i shares a bond with . This list is symmetrical so if j in bond_adj_list[i] then i in bond_adj_list[j].

  • bt_len (int, optional (default 6)) – The number of different bond types to consider.

  • graph_distance (bool, optional (default True)) – If true, use graph distance between molecules. Else use euclidean distance. The specified mol must have a conformer. Atomic positions will be retrieved by calling mol.getConformer(0).

  • max_pair_distance (Optional[int], (default None)) – This value can be a positive integer or None. This parameter determines the maximum graph distance at which pair features are computed. For example, if max_pair_distance==2, then pair features are computed only for atoms at most graph distance 2 apart. If max_pair_distance is None, all pairs are considered (effectively infinite max_pair_distance)

Note

This method requires RDKit to be installed.

Returns

  • features (np.ndarray) – Of shape (N_edges, bt_len + max_distance + 1). This is the array of pairwise features for all atom pairs, where N_edges is the number of edges within max_pair_distance of one another in this molecules.

  • pair_edges (np.ndarray) – Of shape (2, num_pairs) where num_pairs is the total number of pairs within max_pair_distance of one another.

MACCSKeysFingerprint

class MACCSKeysFingerprint[source]

MACCS Keys Fingerprint.

The MACCS (Molecular ACCess System) keys are one of the most commonly used structural keys. Please confirm the details in [1]_, [2]_.

Examples

>>> import deepchem as dc
>>> smiles = 'CC(=O)OC1=CC=CC=C1C(=O)O'
>>> featurizer = dc.feat.MACCSKeysFingerprint()
>>> features = featurizer.featurize([smiles])
>>> type(features[0])
<class 'numpy.ndarray'>
>>> features[0].shape
(167,)

References

1

Durant, Joseph L., et al. “Reoptimization of MDL keys for use in drug discovery.” Journal of chemical information and computer sciences 42.6 (2002): 1273-1280.

2

https://github.com/rdkit/rdkit/blob/master/rdkit/Chem/MACCSkeys.py

Note

This class requires RDKit to be installed.

__init__()[source]

Initialize this featurizer.

MATFeaturizer

class MATFeaturizer(one_hot_formal_charge: bool = True)[source]

This class is a featurizer for the Molecule Attention Transformer [1]_. The featurizer accepts an RDKit Molecule, and a boolean (one_hot_formal_charge) as arguments. The returned value is a numpy array which consists of molecular graph descriptions:

  • Node Features

  • Adjacency Matrix

  • Distance Matrix

References

1

Lukasz Maziarka et al. “Molecule Attention Transformer`<https://arxiv.org/abs/2002.08264>`”

Examples

>>> import deepchem as dc
>>> feat = dc.feat.MATFeaturizer()
>>> out = feat.featurize("CCC")

Note

This class requires RDKit to be installed.

__init__(one_hot_formal_charge: bool = True)[source]
Parameters

one_hot_formal_charge (bool, default True) – If True, formal charges on atoms are one-hot encoded.

atom_features(atom: Any)numpy.ndarray[source]

Deepchem already contains an atom_features function, however we are defining a new one here due to the need to handle features specific to MAT. Since we need new features like Atom GetNeighbors and IsInRing, and the number of features required for MAT is a fraction of what the Deepchem atom_features function computes, we can speed up computation by defining a custom function.

Parameters

atom (RDKitAtom) – RDKit Atom object.

Returns

Atom_features – Numpy array containing atom features.

Return type

ndarray

featurize(molecules, log_every_n=1000)numpy.ndarray[source]

Calculate features for molecules.

Parameters
  • molecules (rdkit.Chem.rdchem.Mol / SMILES string / iterable) – RDKit Mol, or SMILES string or iterable sequence of RDKit mols/SMILES strings.

  • log_every_n (int, default 1000) – Logging messages reported every log_every_n samples.

Returns

features – A numpy array containing a featurized representation of datapoints.

Return type

np.ndarray

CircularFingerprint

class CircularFingerprint(radius: int = 2, size: int = 2048, chiral: bool = False, bonds: bool = True, features: bool = False, sparse: bool = False, smiles: bool = False)[source]

Circular (Morgan) fingerprints.

Extended Connectivity Circular Fingerprints compute a bag-of-words style representation of a molecule by breaking it into local neighborhoods and hashing into a bit vector of the specified size. It is used specifically for structure-activity modelling. See [1]_ for more details.

References

1

Rogers, David, and Mathew Hahn. “Extended-connectivity fingerprints.” Journal of chemical information and modeling 50.5 (2010): 742-754.

Note

This class requires RDKit to be installed.

Examples

>>> import deepchem as dc
>>> from rdkit import Chem
>>> smiles = ['C1=CC=CC=C1']
>>> # Example 1: (size = 2048, radius = 4)
>>> featurizer = dc.feat.CircularFingerprint(size=2048, radius=4)
>>> features = featurizer.featurize(smiles)
>>> type(features[0])
<class 'numpy.ndarray'>
>>> features[0].shape
(2048,)
>>> # Example 2: (size = 2048, radius = 4, sparse = True, smiles = True)
>>> featurizer = dc.feat.CircularFingerprint(size=2048, radius=8,
...                                          sparse=True, smiles=True)
>>> features = featurizer.featurize(smiles)
>>> type(features[0]) # dict containing fingerprints
<class 'dict'>
__init__(radius: int = 2, size: int = 2048, chiral: bool = False, bonds: bool = True, features: bool = False, sparse: bool = False, smiles: bool = False)[source]
Parameters
  • radius (int, optional (default 2)) – Fingerprint radius.

  • size (int, optional (default 2048)) – Length of generated bit vector.

  • chiral (bool, optional (default False)) – Whether to consider chirality in fingerprint generation.

  • bonds (bool, optional (default True)) – Whether to consider bond order in fingerprint generation.

  • features (bool, optional (default False)) – Whether to use feature information instead of atom information; see RDKit docs for more info.

  • sparse (bool, optional (default False)) – Whether to return a dict for each molecule containing the sparse fingerprint.

  • smiles (bool, optional (default False)) – Whether to calculate SMILES strings for fragment IDs (only applicable when calculating sparse fingerprints).

featurize(molecules, log_every_n=1000)numpy.ndarray[source]

Calculate features for molecules.

Parameters
  • molecules (rdkit.Chem.rdchem.Mol / SMILES string / iterable) – RDKit Mol, or SMILES string or iterable sequence of RDKit mols/SMILES strings.

  • log_every_n (int, default 1000) – Logging messages reported every log_every_n samples.

Returns

features – A numpy array containing a featurized representation of datapoints.

Return type

np.ndarray

PubChemFingerprint

class PubChemFingerprint[source]

PubChem Fingerprint.

The PubChem fingerprint is a 881 bit structural key, which is used by PubChem for similarity searching. Please confirm the details in [1]_.

References

1

ftp://ftp.ncbi.nlm.nih.gov/pubchem/specifications/pubchem_fingerprints.pdf

Note

This class requires RDKit and PubChemPy to be installed. PubChemPy use REST API to get the fingerprint, so you need the internet access.

Examples

>>> import deepchem as dc
>>> smiles = ['CCC']
>>> featurizer = dc.feat.PubChemFingerprint()
>>> features = featurizer.featurize(smiles)
>>> type(features[0])
<class 'numpy.ndarray'>
>>> features[0].shape
(881,)
__init__()[source]

Initialize this featurizer.

Mol2VecFingerprint

class Mol2VecFingerprint(pretrain_model_path: Optional[str] = None, radius: int = 1, unseen: str = 'UNK')[source]

Mol2Vec fingerprints.

This class convert molecules to vector representations by using Mol2Vec. Mol2Vec is an unsupervised machine learning approach to learn vector representations of molecular substructures and the algorithm is based on Word2Vec, which is one of the most popular technique to learn word embeddings using neural network in NLP. Please see the details from [1]_.

The Mol2Vec requires the pretrained model, so we use the model which is put on the mol2vec github repository [2]_. The default model was trained on 20 million compounds downloaded from ZINC using the following paramters.

  • radius 1

  • UNK to replace all identifiers that appear less than 4 times

  • skip-gram and window size of 10

  • embeddings size 300

References

1

Jaeger, Sabrina, Simone Fulle, and Samo Turk. “Mol2vec: unsupervised machine learning approach with chemical intuition.” Journal of chemical information and modeling 58.1 (2018): 27-35.

2

https://github.com/samoturk/mol2vec/

Note

This class requires mol2vec to be installed.

Examples

>>> import deepchem as dc
>>> from rdkit import Chem
>>> smiles = ['CCC']
>>> featurizer = dc.feat.Mol2VecFingerprint()
>>> features = featurizer.featurize(smiles)
>>> type(features)
<class 'numpy.ndarray'>
>>> features[0].shape
(300,)
__init__(pretrain_model_path: Optional[str] = None, radius: int = 1, unseen: str = 'UNK')[source]
Parameters
  • pretrain_file (str, optional) – The path for pretrained model. If this value is None, we use the model which is put on github repository (https://github.com/samoturk/mol2vec/tree/master/examples/models). The model is trained on 20 million compounds downloaded from ZINC.

  • radius (int, optional (default 1)) – The fingerprint radius. The default value was used to train the model which is put on github repository.

  • unseen (str, optional (default 'UNK')) – The string to used to replace uncommon words/identifiers while training.

sentences2vec(sentences: list, model, unseen=None)numpy.ndarray[source]

Generate vectors for each sentence (list) in a list of sentences. Vector is simply a sum of vectors for individual words.

Parameters
Returns

Return type

np.array

featurize(molecules, log_every_n=1000)numpy.ndarray[source]

Calculate features for molecules.

Parameters
  • molecules (rdkit.Chem.rdchem.Mol / SMILES string / iterable) – RDKit Mol, or SMILES string or iterable sequence of RDKit mols/SMILES strings.

  • log_every_n (int, default 1000) – Logging messages reported every log_every_n samples.

Returns

features – A numpy array containing a featurized representation of datapoints.

Return type

np.ndarray

RDKitDescriptors

class RDKitDescriptors(use_fragment=True, ipc_avg=True)[source]

RDKit descriptors.

This class computes a list of chemical descriptors like molecular weight, number of valence electrons, maximum and minimum partial charge, etc using RDKit.

descriptors[source]

List of RDKit descriptor names used in this class.

Type

List[str]

Note

This class requires RDKit to be installed.

Examples

>>> import deepchem as dc
>>> smiles = ['CC(=O)OC1=CC=CC=C1C(=O)O']
>>> featurizer = dc.feat.RDKitDescriptors()
>>> features = featurizer.featurize(smiles)
>>> type(features[0])
<class 'numpy.ndarray'>
>>> features[0].shape
(208,)
__init__(use_fragment=True, ipc_avg=True)[source]

Initialize this featurizer.

Parameters
  • use_fragment (bool, optional (default True)) – If True, the return value includes the fragment binary descriptors like ‘fr_XXX’.

  • ipc_avg (bool, optional (default True)) – If True, the IPC descriptor calculates with avg=True option. Please see this issue: https://github.com/rdkit/rdkit/issues/1527.

featurize(molecules, log_every_n=1000)numpy.ndarray[source]

Calculate features for molecules.

Parameters
  • molecules (rdkit.Chem.rdchem.Mol / SMILES string / iterable) – RDKit Mol, or SMILES string or iterable sequence of RDKit mols/SMILES strings.

  • log_every_n (int, default 1000) – Logging messages reported every log_every_n samples.

Returns

features – A numpy array containing a featurized representation of datapoints.

Return type

np.ndarray

MordredDescriptors

class MordredDescriptors(ignore_3D: bool = True)[source]

Mordred descriptors.

This class computes a list of chemical descriptors using Mordred. Please see the details about all descriptors from [1]_, [2]_.

descriptors[source]

List of Mordred descriptor names used in this class.

Type

List[str]

References

1

Moriwaki, Hirotomo, et al. “Mordred: a molecular descriptor calculator.” Journal of cheminformatics 10.1 (2018): 4.

2

http://mordred-descriptor.github.io/documentation/master/descriptors.html

Note

This class requires Mordred to be installed.

Examples

>>> import deepchem as dc
>>> smiles = ['CC(=O)OC1=CC=CC=C1C(=O)O']
>>> featurizer = dc.feat.MordredDescriptors(ignore_3D=True)
>>> features = featurizer.featurize(smiles)
>>> type(features[0])
<class 'numpy.ndarray'>
>>> features[0].shape
(1613,)
__init__(ignore_3D: bool = True)[source]
Parameters

ignore_3D (bool, optional (default True)) – Whether to use 3D information or not.

featurize(molecules, log_every_n=1000)numpy.ndarray[source]

Calculate features for molecules.

Parameters
  • molecules (rdkit.Chem.rdchem.Mol / SMILES string / iterable) – RDKit Mol, or SMILES string or iterable sequence of RDKit mols/SMILES strings.

  • log_every_n (int, default 1000) – Logging messages reported every log_every_n samples.

Returns

features – A numpy array containing a featurized representation of datapoints.

Return type

np.ndarray

CoulombMatrix

class CoulombMatrix(max_atoms: int, remove_hydrogens: bool = False, randomize: bool = False, upper_tri: bool = False, n_samples: int = 1, seed: Optional[int] = None)[source]

Calculate Coulomb matrices for molecules.

Coulomb matrices provide a representation of the electronic structure of a molecule. For a molecule with N atoms, the Coulomb matrix is a N X N matrix where each element gives the strength of the electrostatic interaction between two atoms. The method is described in more detail in [1]_.

Examples

>>> import deepchem as dc
>>> featurizers = dc.feat.CoulombMatrix(max_atoms=23)
>>> input_file = 'deepchem/feat/tests/data/water.sdf' # really backed by water.sdf.csv
>>> tasks = ["atomization_energy"]
>>> loader = dc.data.SDFLoader(tasks, featurizer=featurizers)
>>> dataset = loader.create_dataset(input_file)

References

1

Montavon, Grégoire, et al. “Learning invariant representations of molecules for atomization energy prediction.” Advances in neural information processing systems. 2012.

Note

This class requires RDKit to be installed.

__init__(max_atoms: int, remove_hydrogens: bool = False, randomize: bool = False, upper_tri: bool = False, n_samples: int = 1, seed: Optional[int] = None)[source]

Initialize this featurizer.

Parameters
  • max_atoms (int) – The maximum number of atoms expected for molecules this featurizer will process.

  • remove_hydrogens (bool, optional (default False)) – If True, remove hydrogens before processing them.

  • randomize (bool, optional (default False)) – If True, use method randomize_coulomb_matrices to randomize Coulomb matrices.

  • upper_tri (bool, optional (default False)) – Generate only upper triangle part of Coulomb matrices.

  • n_samples (int, optional (default 1)) – If randomize is set to True, the number of random samples to draw.

  • seed (int, optional (default None)) – Random seed to use.

coulomb_matrix(mol: Any)numpy.ndarray[source]

Generate Coulomb matrices for each conformer of the given molecule.

Parameters

mol (rdkit.Chem.rdchem.Mol) – RDKit Mol object

Returns

The coulomb matrices of the given molecule

Return type

np.ndarray

randomize_coulomb_matrix(m: numpy.ndarray)List[numpy.ndarray][source]

Randomize a Coulomb matrix as decribed in [1]_:

  1. Compute row norms for M in a vector row_norms.

  2. Sample a zero-mean unit-variance noise vector e with dimension equal to row_norms.

  3. Permute the rows and columns of M with the permutation that sorts row_norms + e.

Parameters

m (np.ndarray) – Coulomb matrix.

Returns

List of the random coulomb matrix

Return type

List[np.ndarray]

References

1

Montavon et al., New Journal of Physics, 15, (2013), 095003

static get_interatomic_distances(conf: Any)numpy.ndarray[source]

Get interatomic distances for atoms in a molecular conformer.

Parameters

conf (rdkit.Chem.rdchem.Conformer) – Molecule conformer.

Returns

The distances matrix for all atoms in a molecule

Return type

np.ndarray

featurize(molecules, log_every_n=1000)numpy.ndarray[source]

Calculate features for molecules.

Parameters
  • molecules (rdkit.Chem.rdchem.Mol / SMILES string / iterable) – RDKit Mol, or SMILES string or iterable sequence of RDKit mols/SMILES strings.

  • log_every_n (int, default 1000) – Logging messages reported every log_every_n samples.

Returns

features – A numpy array containing a featurized representation of datapoints.

Return type

np.ndarray

CoulombMatrixEig

class CoulombMatrixEig(max_atoms: int, remove_hydrogens: bool = False, randomize: bool = False, n_samples: int = 1, seed: Optional[int] = None)[source]

Calculate the eigenvalues of Coulomb matrices for molecules.

This featurizer computes the eigenvalues of the Coulomb matrices for provided molecules. Coulomb matrices are described in [1]_.

Examples

>>> import deepchem as dc
>>> featurizers = dc.feat.CoulombMatrixEig(max_atoms=23)
>>> input_file = 'deepchem/feat/tests/data/water.sdf' # really backed by water.sdf.csv
>>> tasks = ["atomization_energy"]
>>> loader = dc.data.SDFLoader(tasks, featurizer=featurizers)
>>> dataset = loader.create_dataset(input_file)

References

1

Montavon, Grégoire, et al. “Learning invariant representations of molecules for atomization energy prediction.” Advances in neural information processing systems. 2012.

__init__(max_atoms: int, remove_hydrogens: bool = False, randomize: bool = False, n_samples: int = 1, seed: Optional[int] = None)[source]

Initialize this featurizer.

Parameters
  • max_atoms (int) – The maximum number of atoms expected for molecules this featurizer will process.

  • remove_hydrogens (bool, optional (default False)) – If True, remove hydrogens before processing them.

  • randomize (bool, optional (default False)) – If True, use method randomize_coulomb_matrices to randomize Coulomb matrices.

  • n_samples (int, optional (default 1)) – If randomize is set to True, the number of random samples to draw.

  • seed (int, optional (default None)) – Random seed to use.

coulomb_matrix(mol: Any)numpy.ndarray[source]

Generate Coulomb matrices for each conformer of the given molecule.

Parameters

mol (rdkit.Chem.rdchem.Mol) – RDKit Mol object

Returns

The coulomb matrices of the given molecule

Return type

np.ndarray

featurize(molecules, log_every_n=1000)numpy.ndarray[source]

Calculate features for molecules.

Parameters
  • molecules (rdkit.Chem.rdchem.Mol / SMILES string / iterable) – RDKit Mol, or SMILES string or iterable sequence of RDKit mols/SMILES strings.

  • log_every_n (int, default 1000) – Logging messages reported every log_every_n samples.

Returns

features – A numpy array containing a featurized representation of datapoints.

Return type

np.ndarray

static get_interatomic_distances(conf: Any)numpy.ndarray[source]

Get interatomic distances for atoms in a molecular conformer.

Parameters

conf (rdkit.Chem.rdchem.Conformer) – Molecule conformer.

Returns

The distances matrix for all atoms in a molecule

Return type

np.ndarray

randomize_coulomb_matrix(m: numpy.ndarray)List[numpy.ndarray][source]

Randomize a Coulomb matrix as decribed in [1]_:

  1. Compute row norms for M in a vector row_norms.

  2. Sample a zero-mean unit-variance noise vector e with dimension equal to row_norms.

  3. Permute the rows and columns of M with the permutation that sorts row_norms + e.

Parameters

m (np.ndarray) – Coulomb matrix.

Returns

List of the random coulomb matrix

Return type

List[np.ndarray]

References

1

Montavon et al., New Journal of Physics, 15, (2013), 095003

AtomCoordinates

class AtomicCoordinates(use_bohr: bool = False)[source]

Calculate atomic coordinates.

Examples

>>> import deepchem as dc
>>> from rdkit import Chem
>>> mol = Chem.MolFromSmiles('C1C=CC=CC=1')
>>> n_atoms = len(mol.GetAtoms())
>>> n_atoms
6
>>> featurizer = dc.feat.AtomicCoordinates(use_bohr=False)
>>> features = featurizer.featurize([mol])
>>> type(features[0])
<class 'numpy.ndarray'>
>>> features[0].shape # (n_atoms, 3)
(6, 3)

Note

This class requires RDKit to be installed.

__init__(use_bohr: bool = False)[source]
Parameters

use_bohr (bool, optional (default False)) – Whether to use bohr or angstrom as a coordinate unit.

featurize(molecules, log_every_n=1000)numpy.ndarray[source]

Calculate features for molecules.

Parameters
  • molecules (rdkit.Chem.rdchem.Mol / SMILES string / iterable) – RDKit Mol, or SMILES string or iterable sequence of RDKit mols/SMILES strings.

  • log_every_n (int, default 1000) – Logging messages reported every log_every_n samples.

Returns

features – A numpy array containing a featurized representation of datapoints.

Return type

np.ndarray

BPSymmetryFunctionInput

class BPSymmetryFunctionInput(max_atoms: int)[source]

Calculate symmetry function for each atom in the molecules

This method is described in [1]_.

Examples

>>> import deepchem as dc
>>> smiles = ['C1C=CC=CC=1']
>>> featurizer = dc.feat.BPSymmetryFunctionInput(max_atoms=10)
>>> features = featurizer.featurize(smiles)
>>> type(features[0])
<class 'numpy.ndarray'>
>>> features[0].shape  # (max_atoms, 4)
(10, 4)

References

1

Behler, Jörg, and Michele Parrinello. “Generalized neural-network representation of high-dimensional potential-energy surfaces.” Physical review letters 98.14 (2007): 146401.

Note

This class requires RDKit to be installed.

__init__(max_atoms: int)[source]

Initialize this featurizer.

Parameters

max_atoms (int) – The maximum number of atoms expected for molecules this featurizer will process.

featurize(molecules, log_every_n=1000)numpy.ndarray[source]

Calculate features for molecules.

Parameters
  • molecules (rdkit.Chem.rdchem.Mol / SMILES string / iterable) – RDKit Mol, or SMILES string or iterable sequence of RDKit mols/SMILES strings.

  • log_every_n (int, default 1000) – Logging messages reported every log_every_n samples.

Returns

features – A numpy array containing a featurized representation of datapoints.

Return type

np.ndarray

SmilesToSeq

class SmilesToSeq(char_to_idx: Dict[str, int], max_len: int = 250, pad_len: int = 10)[source]

SmilesToSeq Featurizer takes a SMILES string, and turns it into a sequence. Details taken from [1]_.

SMILES strings smaller than a specified max length (max_len) are padded using the PAD token while those larger than the max length are not considered. Based on the paper, there is also the option to add extra padding (pad_len) on both sides of the string after length normalization. Using a character to index (char_to_idx) mapping, the SMILES characters are turned into indices and the resulting sequence of indices serves as the input for an embedding layer.

References

1

Goh, Garrett B., et al. “Using rule-based labels for weak supervised learning: a ChemNet for transferable chemical property prediction.” Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 2018.

Note

This class requires RDKit to be installed.

__init__(char_to_idx: Dict[str, int], max_len: int = 250, pad_len: int = 10)[source]

Initialize this class.

Parameters
  • char_to_idx (Dict) – Dictionary containing character to index mappings for unique characters

  • max_len (int, default 250) – Maximum allowed length of the SMILES string.

  • pad_len (int, default 10) – Amount of padding to add on either side of the SMILES seq

to_seq(smile: List[str])numpy.ndarray[source]

Turns list of smiles characters into array of indices

remove_pad(characters: List[str])List[str][source]

Removes PAD_TOKEN from the character list.

smiles_from_seq(seq: List[int])str[source]

Reconstructs SMILES string from sequence.

featurize(molecules, log_every_n=1000)numpy.ndarray[source]

Calculate features for molecules.

Parameters
  • molecules (rdkit.Chem.rdchem.Mol / SMILES string / iterable) – RDKit Mol, or SMILES string or iterable sequence of RDKit mols/SMILES strings.

  • log_every_n (int, default 1000) – Logging messages reported every log_every_n samples.

Returns

features – A numpy array containing a featurized representation of datapoints.

Return type

np.ndarray

SmilesToImage

class SmilesToImage(img_size: int = 80, res: float = 0.5, max_len: int = 250, img_spec: str = 'std')[source]

Convert SMILES string to an image.

SmilesToImage Featurizer takes a SMILES string, and turns it into an image. Details taken from [1]_.

The default size of for the image is 80 x 80. Two image modes are currently supported - std & engd. std is the gray scale specification, with atomic numbers as pixel values for atom positions and a constant value of 2 for bond positions. engd is a 4-channel specification, which uses atom properties like hybridization, valency, charges in addition to atomic number. Bond type is also used for the bonds.

The coordinates of all atoms are computed, and lines are drawn between atoms to indicate bonds. For the respective channels, the atom and bond positions are set to the property values as mentioned in the paper.

Examples

>>> import deepchem as dc
>>> smiles = ['CC(=O)OC1=CC=CC=C1C(=O)O']
>>> featurizer = dc.feat.SmilesToImage(img_size=80, img_spec='std')
>>> images = featurizer.featurize(smiles)
>>> type (images[0])
<class 'numpy.ndarray'>
>>> images[0].shape # (img_size, img_size, 1)
(80, 80, 1)

References

1

Goh, Garrett B., et al. “Using rule-based labels for weak supervised learning: a ChemNet for transferable chemical property prediction.” Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 2018.

Note

This class requires RDKit to be installed.

__init__(img_size: int = 80, res: float = 0.5, max_len: int = 250, img_spec: str = 'std')[source]
Parameters
  • img_size (int, default 80) – Size of the image tensor

  • res (float, default 0.5) – Displays the resolution of each pixel in Angstrom

  • max_len (int, default 250) – Maximum allowed length of SMILES string

  • img_spec (str, default std) – Indicates the channel organization of the image tensor

featurize(molecules, log_every_n=1000)numpy.ndarray[source]

Calculate features for molecules.

Parameters
  • molecules (rdkit.Chem.rdchem.Mol / SMILES string / iterable) – RDKit Mol, or SMILES string or iterable sequence of RDKit mols/SMILES strings.

  • log_every_n (int, default 1000) – Logging messages reported every log_every_n samples.

Returns

features – A numpy array containing a featurized representation of datapoints.

Return type

np.ndarray

OneHotFeaturizer

class OneHotFeaturizer(charset: List[str] = ['#', ')', '(', '+', '-', '/', '1', '3', '2', '5', '4', '7', '6', '8', '=', '@', 'C', 'B', 'F', 'I', 'H', 'O', 'N', 'S', '[', ']', '\\', 'c', 'l', 'o', 'n', 'p', 's', 'r'], max_length: Optional[int] = 100)[source]

Encodes any arbitrary string or molecule as a one-hot array.

This featurizer encodes the characters within any given string as a one-hot array. It also works with RDKit molecules: it can convert RDKit molecules to SMILES strings and then one-hot encode the characters in said strings.

Standalone Usage:

>>> import deepchem as dc
>>> featurizer = dc.feat.OneHotFeaturizer()
>>> smiles = ['CCC']
>>> encodings = featurizer.featurize(smiles)
>>> type(encodings[0])
<class 'numpy.ndarray'>
>>> encodings[0].shape
(100, 35)
>>> featurizer.untransform(encodings[0])
'CCC'

Note

This class needs RDKit to be installed in order to accept RDKit molecules as inputs.

It does not need RDKit to be installed to work with arbitrary strings.

__init__(charset: List[str] = ['#', ')', '(', '+', '-', '/', '1', '3', '2', '5', '4', '7', '6', '8', '=', '@', 'C', 'B', 'F', 'I', 'H', 'O', 'N', 'S', '[', ']', '\\', 'c', 'l', 'o', 'n', 'p', 's', 'r'], max_length: Optional[int] = 100)[source]

Initialize featurizer.

Parameters
  • charset (List[str] (default ZINC_CHARSET)) – A list of strings, where each string is length 1 and unique.

  • max_length (Optional[int], optional (default 100)) –

    The max length for string. If the length of string is shorter than max_length, the string is padded using space.

    If max_length is None, no padding is performed and arbitrary length strings are allowed.

featurize(datapoints: Iterable[Any], log_every_n: int = 1000)numpy.ndarray[source]

Featurize strings or mols.

Parameters
  • datapoints (list) – A list of either strings (str or numpy.str_) or RDKit molecules.

  • log_every_n (int, optional (default 1000)) – How many elements are featurized every time a featurization is logged.

pad_smile(smiles: str)str[source]

Pad SMILES string to self.pad_length

Parameters

smiles (str) – The SMILES string to be padded.

Returns

SMILES string space padded to self.pad_length

Return type

str

pad_string(string: str)str[source]

Pad string to self.pad_length

Parameters

string (str) – The string to be padded.

Returns

String space padded to self.pad_length

Return type

str

untransform(one_hot_vectors: numpy.ndarray)str[source]

Convert from one hot representation back to original string

Parameters

one_hot_vectors (np.ndarray) – An array of one hot encoded features.

Returns

Original string for an one hot encoded array.

Return type

str

RawFeaturizer

class RawFeaturizer(smiles: bool = False)[source]

Encodes a molecule as a SMILES string or RDKit mol.

This featurizer can be useful when you’re trying to transform a large collection of RDKit mol objects as Smiles strings, or alternatively as a “no-op” featurizer in your molecular pipeline.

Note

This class requires RDKit to be installed.

__init__(smiles: bool = False)[source]

Initialize this featurizer.

Parameters

smiles (bool, optional (default False)) – If True, encode this molecule as a SMILES string. Else as a RDKit mol.

featurize(molecules, log_every_n=1000)numpy.ndarray[source]

Calculate features for molecules.

Parameters
  • molecules (rdkit.Chem.rdchem.Mol / SMILES string / iterable) – RDKit Mol, or SMILES string or iterable sequence of RDKit mols/SMILES strings.

  • log_every_n (int, default 1000) – Logging messages reported every log_every_n samples.

Returns

features – A numpy array containing a featurized representation of datapoints.

Return type

np.ndarray

Molecular Complex Featurizers

These featurizers work with three dimensional molecular complexes.

RdkitGridFeaturizer

class RdkitGridFeaturizer(nb_rotations=0, feature_types=None, ecfp_degree=2, ecfp_power=3, splif_power=3, box_width=16.0, voxel_width=1.0, flatten=False, verbose=True, sanitize=False, **kwargs)[source]

Featurizes protein-ligand complex using flat features or a 3D grid (in which each voxel is described with a vector of features).

__init__(nb_rotations=0, feature_types=None, ecfp_degree=2, ecfp_power=3, splif_power=3, box_width=16.0, voxel_width=1.0, flatten=False, verbose=True, sanitize=False, **kwargs)[source]
Parameters
  • nb_rotations (int, optional (default 0)) – Number of additional random rotations of a complex to generate.

  • feature_types (list, optional (default ['ecfp'])) –

    Types of features to calculate. Available types are

    flat features -> ‘ecfp_ligand’, ‘ecfp_hashed’, ‘splif_hashed’, ‘hbond_count’ voxel features -> ‘ecfp’, ‘splif’, ‘sybyl’, ‘salt_bridge’, ‘charge’, ‘hbond’, ‘pi_stack, ‘cation_pi’

    There are also 3 predefined sets of features

    ’flat_combined’, ‘voxel_combined’, and ‘all_combined’.

    Calculated features are concatenated and their order is preserved (features in predefined sets are in alphabetical order).

  • ecfp_degree (int, optional (default 2)) – ECFP radius.

  • ecfp_power (int, optional (default 3)) – Number of bits to store ECFP features (resulting vector will be 2^ecfp_power long)

  • splif_power (int, optional (default 3)) – Number of bits to store SPLIF features (resulting vector will be 2^splif_power long)

  • box_width (float, optional (default 16.0)) – Size of a box in which voxel features are calculated. Box is centered on a ligand centroid.

  • voxel_width (float, optional (default 1.0)) – Size of a 3D voxel in a grid.

  • flatten (bool, optional (defaul False)) – Indicate whether calculated features should be flattened. Output is always flattened if flat features are specified in feature_types.

  • verbose (bool, optional (defaul True)) – Verbolity for logging

  • sanitize (bool, optional (defaul False)) – If set to True molecules will be sanitized. Note that calculating some features (e.g. aromatic interactions) require sanitized molecules.

  • **kwargs (dict, optional) – Keyword arguments can be usaed to specify custom cutoffs and bins (see default values below).

  • cutoffs and bins (Default) –

  • ------------------------

  • hbond_dist_bins ([(2.2, 2.5), (2.5, 3.2), (3.2, 4.0)]) –

  • hbond_angle_cutoffs ([5, 50, 90]) –

  • splif_contact_bins ([(0, 2.0), (2.0, 3.0), (3.0, 4.5)]) –

  • ecfp_cutoff (4.5) –

  • sybyl_cutoff (7.0) –

  • salt_bridges_cutoff (5.0) –

  • pi_stack_dist_cutoff (4.4) –

  • pi_stack_angle_cutoff (30.0) –

  • cation_pi_dist_cutoff (6.5) –

  • cation_pi_angle_cutoff (30.0) –

featurize(complexes: Iterable[Tuple[str, str]], log_every_n: int = 100)numpy.ndarray[source]

Calculate features for mol/protein complexes.

Parameters

complexes (Iterable[Tuple[str, str]]) – List of filenames (PDB, SDF, etc.) for ligand molecules and proteins. Each element should be a tuple of the form (ligand_filename, protein_filename).

Returns

features – Array of features

Return type

np.ndarray

AtomicConvFeaturizer

class AtomicConvFeaturizer(frag1_num_atoms, frag2_num_atoms, complex_num_atoms, max_num_neighbors, neighbor_cutoff, strip_hydrogens=True)[source]

This class computes the featurization that corresponds to AtomicConvModel.

This class computes featurizations needed for AtomicConvModel. Given two molecular structures, it computes a number of useful geometric features. In particular, for each molecule and the global complex, it computes a coordinates matrix of size (N_atoms, 3) where N_atoms is the number of atoms. It also computes a neighbor-list, a dictionary with N_atoms elements where neighbor-list[i] is a list of the atoms the i-th atom has as neighbors. In addition, it computes a z-matrix for the molecule which is an array of shape (N_atoms,) that contains the atomic number of that atom.

Since the featurization computes these three quantities for each of the two molecules and the complex, a total of 9 quantities are returned for each complex. Note that for efficiency, fragments of the molecules can be provided rather than the full molecules themselves.

__init__(frag1_num_atoms, frag2_num_atoms, complex_num_atoms, max_num_neighbors, neighbor_cutoff, strip_hydrogens=True)[source]
Parameters
  • frag1_num_atoms (int) – Maximum number of atoms in fragment 1.

  • frag2_num_atoms (int) – Maximum number of atoms in fragment 2.

  • complex_num_atoms (int) – Maximum number of atoms in complex of frag1/frag2 together.

  • max_num_neighbors (int) – Maximum number of atoms considered as neighbors.

  • neighbor_cutoff (float) – Maximum distance (angstroms) for two atoms to be considered as neighbors. If more than max_num_neighbors atoms fall within this cutoff, the closest max_num_neighbors will be used.

  • strip_hydrogens (bool (default True)) – Remove hydrogens before computing featurization.

featurize(complexes: Iterable[Tuple[str, str]], log_every_n: int = 100)numpy.ndarray[source]

Calculate features for mol/protein complexes.

Parameters

complexes (Iterable[Tuple[str, str]]) – List of filenames (PDB, SDF, etc.) for ligand molecules and proteins. Each element should be a tuple of the form (ligand_filename, protein_filename).

Returns

features – Array of features

Return type

np.ndarray

Inorganic Crystal Featurizers

These featurizers work with datasets of inorganic crystals.

MaterialCompositionFeaturizer

Material Composition Featurizers are those that work with datasets of crystal compositions with periodic boundary conditions. For inorganic crystal structures, these featurizers operate on chemical compositions (e.g. “MoS2”). They should be applied on systems that have periodic boundary conditions. Composition featurizers are not designed to work with molecules.

ElementPropertyFingerprint

class ElementPropertyFingerprint(data_source: str = 'matminer')[source]

Fingerprint of elemental properties from composition.

Based on the data source chosen, returns properties and statistics (min, max, range, mean, standard deviation, mode) for a compound based on elemental stoichiometry. E.g., the average electronegativity of atoms in a crystal structure. The chemical fingerprint is a vector of these statistics. For a full list of properties and statistics, see matminer.featurizers.composition.ElementProperty(data_source).feature_labels().

This featurizer requires the optional dependencies pymatgen and matminer. It may be useful when only crystal compositions are available (and not 3D coordinates).

See references [1]_, [2]_, 3, 4 for more details.

References

1

MagPie data: Ward, L. et al. npj Comput Mater 2, 16028 (2016). https://doi.org/10.1038/npjcompumats.2016.28

2

Deml data: Deml, A. et al. Physical Review B 93, 085142 (2016). 10.1103/PhysRevB.93.085142

3

Matminer: Ward, L. et al. Comput. Mater. Sci. 152, 60-69 (2018).

4

Pymatgen: Ong, S.P. et al. Comput. Mater. Sci. 68, 314-319 (2013).

Examples

>>> import deepchem as dc
>>> import pymatgen as mg
>>> comp = mg.core.Composition("Fe2O3")
>>> featurizer = dc.feat.ElementPropertyFingerprint()
>>> features = featurizer.featurize([comp])
>>> type(features[0])
<class 'numpy.ndarray'>
>>> features[0].shape
(65,)

Note

This class requires matminer and Pymatgen to be installed. NaN feature values are automatically converted to 0 by this featurizer.

__init__(data_source: str = 'matminer')[source]
Parameters

data_source (str of "matminer", "magpie" or "deml" (default "matminer")) – Source for element property data.

featurize(compositions: Iterable[str], log_every_n: int = 1000)numpy.ndarray[source]

Calculate features for crystal compositions.

Parameters
  • compositions (Iterable[str]) – Iterable sequence of composition strings, e.g. “MoS2”.

  • log_every_n (int, default 1000) – Logging messages reported every log_every_n samples.

Returns

features – A numpy array containing a featurized representation of compositions.

Return type

np.ndarray

ElemNetFeaturizer

class ElemNetFeaturizer[source]

Fixed size vector of length 86 containing raw fractional elemental compositions in the compound. The 86 chosen elements are based on the original implementation at https://github.com/NU-CUCIS/ElemNet.

Returns a vector containing fractional compositions of each element in the compound.

References

1

Jha, D., Ward, L., Paul, A. et al. Sci Rep 8, 17593 (2018). https://doi.org/10.1038/s41598-018-35934-y

Examples

>>> import deepchem as dc
>>> comp = "Fe2O3"
>>> featurizer = dc.feat.ElemNetFeaturizer()
>>> features = featurizer.featurize([comp])
>>> type(features[0])
<class 'numpy.ndarray'>
>>> features[0].shape
(86,)
>>> round(sum(features[0]))
1

Note

This class requires Pymatgen to be installed.

get_vector(comp: DefaultDict)Optional[numpy.ndarray][source]

Converts a dictionary containing element names and corresponding compositional fractions into a vector of fractions.

Parameters

comp (collections.defaultdict object) – Dictionary mapping element names to fractional compositions.

Returns

fractions – Vector of fractional compositions of each element.

Return type

np.ndarray

MaterialStructureFeaturizer

Material Structure Featurizers are those that work with datasets of crystals with periodic boundary conditions. For inorganic crystal structures, these featurizers operate on pymatgen.Structure objects, which include a lattice and 3D coordinates that specify a periodic crystal structure. They should be applied on systems that have periodic boundary conditions. Structure featurizers are not designed to work with molecules.

SineCoulombMatrix

class SineCoulombMatrix(max_atoms: int = 100, flatten: bool = True)[source]

Calculate sine Coulomb matrix for crystals.

A variant of Coulomb matrix for periodic crystals.

The sine Coulomb matrix is identical to the Coulomb matrix, except that the inverse distance function is replaced by the inverse of sin**2 of the vector between sites which are periodic in the dimensions of the crystal lattice.

Features are flattened into a vector of matrix eigenvalues by default for ML-readiness. To ensure that all feature vectors are equal length, the maximum number of atoms (eigenvalues) in the input dataset must be specified.

This featurizer requires the optional dependencies pymatgen and matminer. It may be useful when crystal structures with 3D coordinates are available.

See [1]_ for more details.

References

1

Faber et al. “Crystal Structure Representations for Machine Learning Models of Formation Energies”, Inter. J. Quantum Chem. 115, 16, 2015. https://arxiv.org/abs/1503.07406

Examples

>>> import deepchem as dc
>>> import pymatgen as mg
>>> lattice = mg.core.Lattice.cubic(4.2)
>>> structure = mg.core.Structure(lattice, ["Cs", "Cl"], [[0, 0, 0], [0.5, 0.5, 0.5]])
>>> featurizer = dc.feat.SineCoulombMatrix(max_atoms=2)
>>> features = featurizer.featurize([structure])
>>> type(features[0])
<class 'numpy.ndarray'>
>>> features[0].shape # (max_atoms,)
(2,)

Note

This class requires matminer and Pymatgen to be installed.

__init__(max_atoms: int = 100, flatten: bool = True)[source]
Parameters
  • max_atoms (int (default 100)) – Maximum number of atoms for any crystal in the dataset. Used to pad the Coulomb matrix.

  • flatten (bool (default True)) – Return flattened vector of matrix eigenvalues.

featurize(structures: Iterable[Union[Dict[str, Any], Any]], log_every_n: int = 1000)numpy.ndarray[source]

Calculate features for crystal structures.

Parameters
  • structures (Iterable[Union[Dict, pymatgen.core.Structure]]) – Iterable sequence of pymatgen structure dictionaries or pymatgen.core.Structure. Please confirm the dictionary representations of pymatgen.core.Structure from https://pymatgen.org/pymatgen.core.structure.html.

  • log_every_n (int, default 1000) – Logging messages reported every log_every_n samples.

Returns

features – A numpy array containing a featurized representation of structures.

Return type

np.ndarray

CGCNNFeaturizer

class CGCNNFeaturizer(radius: float = 8.0, max_neighbors: float = 12, step: float = 0.2)[source]

Calculate structure graph features for crystals.

Based on the implementation in Crystal Graph Convolutional Neural Networks (CGCNN). The method constructs a crystal graph representation including atom features and bond features (neighbor distances). Neighbors are determined by searching in a sphere around atoms in the unit cell. A Gaussian filter is applied to neighbor distances. All units are in angstrom.

This featurizer requires the optional dependency pymatgen. It may be useful when 3D coordinates are available and when using graph network models and crystal graph convolutional networks.

See [1]_ for more details.

References

1

T. Xie and J. C. Grossman, “Crystal graph convolutional neural networks for an accurate and interpretable prediction of material properties”, Phys. Rev. Lett. 120, 2018, https://arxiv.org/abs/1710.10324

Examples

>>> import deepchem as dc
>>> import pymatgen as mg
>>> featurizer = dc.feat.CGCNNFeaturizer()
>>> lattice = mg.core.Lattice.cubic(4.2)
>>> structure = mg.core.Structure(lattice, ["Cs", "Cl"], [[0, 0, 0], [0.5, 0.5, 0.5]])
>>> features = featurizer.featurize([structure])
>>> feature = features[0]
>>> print(type(feature))
<class 'deepchem.feat.graph_data.GraphData'>

Note

This class requires Pymatgen to be installed.

__init__(radius: float = 8.0, max_neighbors: float = 12, step: float = 0.2)[source]
Parameters
  • radius (float (default 8.0)) – Radius of sphere for finding neighbors of atoms in unit cell.

  • max_neighbors (int (default 12)) – Maximum number of neighbors to consider when constructing graph.

  • step (float (default 0.2)) – Step size for Gaussian filter. This value is used when building edge features.

featurize(structures: Iterable[Union[Dict[str, Any], Any]], log_every_n: int = 1000)numpy.ndarray[source]

Calculate features for crystal structures.

Parameters
  • structures (Iterable[Union[Dict, pymatgen.core.Structure]]) – Iterable sequence of pymatgen structure dictionaries or pymatgen.core.Structure. Please confirm the dictionary representations of pymatgen.core.Structure from https://pymatgen.org/pymatgen.core.structure.html.

  • log_every_n (int, default 1000) – Logging messages reported every log_every_n samples.

Returns

features – A numpy array containing a featurized representation of structures.

Return type

np.ndarray

LCNNFeaturizer

class LCNNFeaturizer(structure: Any, aos: List[str], pbc: List[bool], ns: int = 1, na: int = 1, cutoff: float = 6.0)[source]

Calculates the 2-D Surface graph features in 6 different permutations-

Based on the implementation of Lattice Graph Convolution Neural Network (LCNN). This method produces the Atom wise features ( One Hot Encoding) and Adjacent neighbour in the specified order of permutations. Neighbors are determined by first extracting a site local environment from the primitive cell, and perform graph matching and distance matching to find neighbors. First, the template of the Primitive cell needs to be defined along with periodic boundary conditions and active and spectator site details. structure(Data Point i.e different configuration of adsorbate atoms) is passed for featurization.

This particular featurization produces a regular-graph (equal number of Neighbors) along with its permutation in 6 symmetric axis. This transformation can be applied when orderering of neighboring of nodes around a site play an important role in the propert predictions. Due to consideration of local neighbor environment, this current implementation would be fruitful in finding neighbors for calculating formation energy of adbsorption tasks where the local. Adsorption turns out to be important in many applications such as catalyst and semiconductor design.

The permuted neighbors are calculated using the Primitive cells i.e periodic cells in all the data points are built via lattice transformation of the primitive cell.

Primitive cell Format:

  1. Pymatgen structure object with site_properties key value

  • “SiteTypes” mentioning if it is a active site “A1” or spectator site “S1”.

  1. ns , the number of spectator types elements. For “S1” its 1.

  2. na , the number of active types elements. For “A1” its 1.

  3. aos, the different species of active elements “A1”.

  4. pbc, the periodic boundary conditions.

Data point Structure Format(Configuration of Atoms):

  1. Pymatgen structure object with site_properties with following key value.

  • “SiteTypes”, mentioning if it is a active site “A1” or spectator site “S1”.

  • “oss”, different occupational sites. For spectator sites make it -1.

It is highly recommended that cells of data are directly redefined from the primitive cell, specifically, the relative coordinates between sites are consistent so that the lattice is non-deviated.

References

1

Jonathan Lym and Geun Ho Gu, J. Phys. Chem. C 2019, 123, 18951−18959

Examples

>>> import deepchem as dc
>>> from pymatgen.core import Structure
>>> import numpy as np
>>> PRIMITIVE_CELL = {
...   "lattice": [[2.818528, 0.0, 0.0],
...               [-1.409264, 2.440917, 0.0],
...               [0.0, 0.0, 25.508255]],
...   "coords": [[0.66667, 0.33333, 0.090221],
...              [0.33333, 0.66667, 0.18043936],
...              [0.0, 0.0, 0.27065772],
...              [0.66667, 0.33333, 0.36087608],
...              [0.33333, 0.66667, 0.45109444],
...              [0.0, 0.0, 0.49656991]],
...   "species": ['H', 'H', 'H', 'H', 'H', 'He'],
...   "site_properties": {'SiteTypes': ['S1', 'S1', 'S1', 'S1', 'S1', 'A1']}
... }
>>> PRIMITIVE_CELL_INF0 = {
...    "cutoff": np.around(6.00),
...    "structure": Structure(**PRIMITIVE_CELL),
...    "aos": ['1', '0', '2'],
...    "pbc": [True, True, False],
...    "ns": 1,
...    "na": 1
... }
>>> DATA_POINT = {
...   "lattice": [[1.409264, -2.440917, 0.0],
...               [4.227792, 2.440917, 0.0],
...               [0.0, 0.0, 23.17559]],
...   "coords": [[0.0, 0.0, 0.099299],
...              [0.0, 0.33333, 0.198598],
...              [0.5, 0.16667, 0.297897],
...              [0.0, 0.0, 0.397196],
...              [0.0, 0.33333, 0.496495],
...              [0.5, 0.5, 0.099299],
...              [0.5, 0.83333, 0.198598],
...              [0.0, 0.66667, 0.297897],
...              [0.5, 0.5, 0.397196],
...              [0.5, 0.83333, 0.496495],
...              [0.0, 0.66667, 0.54654766],
...              [0.5, 0.16667, 0.54654766]],
...   "species": ['H', 'H', 'H', 'H', 'H', 'H',
...               'H', 'H', 'H', 'H', 'He', 'He'],
...   "site_properties": {
...     "SiteTypes": ['S1', 'S1', 'S1', 'S1', 'S1',
...                   'S1', 'S1', 'S1', 'S1', 'S1',
...                   'A1', 'A1'],
...     "oss": ['-1', '-1', '-1', '-1', '-1', '-1',
...             '-1', '-1', '-1', '-1', '0', '2']
...                   }
... }
>>> featuriser = dc.feat.LCNNFeaturizer(**PRIMITIVE_CELL_INF0)
>>> print(type(featuriser._featurize(Structure(**DATA_POINT))))
<class 'deepchem.feat.graph_data.GraphData'>

Notes

This Class requires pymatgen , networkx , scipy installed.

__init__(structure: Any, aos: List[str], pbc: List[bool], ns: int = 1, na: int = 1, cutoff: float = 6.0)[source]
Parameters
  • structure (: PymatgenStructure) – Pymatgen Structure object of the primitive cell used for calculating neighbors from lattice transformations.It also requires site_properties attribute with “Sitetypes”(Active or spectator site).

  • aos (List[str]) – A list of all the active site species. For the Pt, N, NO configuration set it as [‘0’, ‘1’, ‘2’]

  • pbc (List[bool]) – Periodic Boundary Condition

  • ns (int (default 1)) – The number of spectator types elements. For “S1” its 1.

  • na (int (default 1)) – the number of active types elements. For “A1” its 1.

  • cutoff (float (default 6.00)) – Cutoff of radius for getting local environment.Only used down to 2 digits.

featurize(structures: Iterable[Union[Dict[str, Any], Any]], log_every_n: int = 1000)numpy.ndarray[source]

Calculate features for crystal structures.

Parameters
  • structures (Iterable[Union[Dict, pymatgen.core.Structure]]) – Iterable sequence of pymatgen structure dictionaries or pymatgen.core.Structure. Please confirm the dictionary representations of pymatgen.core.Structure from https://pymatgen.org/pymatgen.core.structure.html.

  • log_every_n (int, default 1000) – Logging messages reported every log_every_n samples.

Returns

features – A numpy array containing a featurized representation of structures.

Return type

np.ndarray

Molecule Tokenizers

A tokenizer is in charge of preparing the inputs for a natural language processing model. For many scientific applications, it is possible to treat inputs as “words”/”sentences” and use NLP methods to make meaningful predictions. For example, SMILES strings or DNA sequences have grammatical structure and can be usefully modeled with NLP techniques. DeepChem provides some scientifically relevant tokenizers for use in different applications. These tokenizers are based on those from the Huggingface transformers library (which DeepChem tokenizers inherit from).

The base classes PreTrainedTokenizer and PreTrainedTokenizerFast implements the common methods for encoding string inputs in model inputs and instantiating/saving python tokenizers either from a local file or directory or from a pretrained tokenizer provided by the library (downloaded from HuggingFace’s AWS S3 repository).

PreTrainedTokenizer (transformers.PreTrainedTokenizer) thus implements the main methods for using all the tokenizers:

  • Tokenizing (spliting strings in sub-word token strings), converting tokens strings to ids and back, and encoding/decoding (i.e. tokenizing + convert to integers)

  • Adding new tokens to the vocabulary in a way that is independent of the underlying structure (BPE, SentencePiece…)

  • Managing special tokens like mask, beginning-of-sentence, etc tokens (adding them, assigning them to attributes in the tokenizer for easy access and making sure they are not split during tokenization)

BatchEncoding holds the output of the tokenizer’s encoding methods (__call__, encode_plus and batch_encode_plus) and is derived from a Python dictionary. When the tokenizer is a pure python tokenizer, this class behave just like a standard python dictionary and hold the various model inputs computed by these methodes (input_ids, attention_mask…). For more details on the base tokenizers which the DeepChem tokenizers inherit from, please refer to the following: HuggingFace tokenizers docs

Tokenization methods on string-based corpuses in the life sciences are becoming increasingly popular for NLP-based applications to chemistry and biology. One such example is ChemBERTa, a transformer for molecular property prediction. DeepChem offers a tutorial for utilizing ChemBERTa using an alternate tokenizer, a Byte-Piece Encoder, which can be found here.

SmilesTokenizer

The dc.feat.SmilesTokenizer module inherits from the BertTokenizer class in transformers. It runs a WordPiece tokenization algorithm over SMILES strings using the tokenisation SMILES regex developed by Schwaller et. al.

The SmilesTokenizer employs an atom-wise tokenization strategy using the following Regex expression:

SMI_REGEX_PATTERN = "(\[[^\]]+]|Br?|Cl?|N|O|S|P|F|I|b|c|n|o|s|p|\(|\)|\.|=|#||\+|\\\\\/|:||@|\?|>|\*|\$|\%[0–9]{2}|[0–9])"

To use, please install the transformers package using the following pip command:

pip install transformers

References:

class SmilesTokenizer(vocab_file: str = '', **kwargs)[source]

Creates the SmilesTokenizer class. The tokenizer heavily inherits from the BertTokenizer implementation found in Huggingface’s transformers library. It runs a WordPiece tokenization algorithm over SMILES strings using the tokenisation SMILES regex developed by Schwaller et. al.

Please see https://github.com/huggingface/transformers and https://github.com/rxn4chemistry/rxnfp for more details.

Examples

>>> from deepchem.feat.smiles_tokenizer import SmilesTokenizer
>>> current_dir = os.path.dirname(os.path.realpath(__file__))
>>> vocab_path = os.path.join(current_dir, 'tests/data', 'vocab.txt')
>>> tokenizer = SmilesTokenizer(vocab_path)
>>> print(tokenizer.encode("CC(=O)OC1=CC=CC=C1C(=O)O"))
[12, 16, 16, 17, 22, 19, 18, 19, 16, 20, 22, 16, 16, 22, 16, 16, 22, 16, 20, 16, 17, 22, 19, 18, 19, 13]

References

1

Schwaller, Philippe; Probst, Daniel; Vaucher, Alain C.; Nair, Vishnu H; Kreutter, David; Laino, Teodoro; et al. (2019): Mapping the Space of Chemical Reactions using Attention-Based Neural Networks. ChemRxiv. Preprint. https://doi.org/10.26434/chemrxiv.9897365.v3

Note

This class requires huggingface’s transformers and tokenizers libraries to be installed.

__init__(vocab_file: str = '', **kwargs)[source]

Constructs a SmilesTokenizer.

Parameters

vocab_file (str) – Path to a SMILES character per line vocabulary file. Default vocab file is found in deepchem/feat/tests/data/vocab.txt

property vocab_size[source]

Size of the base vocabulary (without the added tokens).

Type

int

convert_tokens_to_string(tokens: List[str])[source]

Converts a sequence of tokens (string) in a single string.

Parameters

tokens (List[str]) – List of tokens for a given string sequence.

Returns

out_string – Single string from combined tokens.

Return type

str

add_special_tokens_ids_single_sequence(token_ids: List[int])[source]

Adds special tokens to the a sequence for sequence classification tasks.

A BERT sequence has the following format: [CLS] X [SEP]

Parameters

token_ids (list[int]) – list of tokenized input ids. Can be obtained using the encode or encode_plus methods.

add_special_tokens_single_sequence(tokens: List[str])[source]

Adds special tokens to the a sequence for sequence classification tasks. A BERT sequence has the following format: [CLS] X [SEP]

Parameters

tokens (List[str]) – List of tokens for a given string sequence.

add_special_tokens_ids_sequence_pair(token_ids_0: List[int], token_ids_1: List[int])List[int][source]

Adds special tokens to a sequence pair for sequence classification tasks. A BERT sequence pair has the following format: [CLS] A [SEP] B [SEP]

Parameters
  • token_ids_0 (List[int]) – List of ids for the first string sequence in the sequence pair (A).

  • token_ids_1 (List[int]) – List of tokens for the second string sequence in the sequence pair (B).

add_padding_tokens(token_ids: List[int], length: int, right: bool = True)List[int][source]

Adds padding tokens to return a sequence of length max_length. By default padding tokens are added to the right of the sequence.

Parameters
  • token_ids (list[int]) – list of tokenized input ids. Can be obtained using the encode or encode_plus methods.

  • length (int) – TODO

  • right (bool, default True) – TODO

Returns

TODO

Return type

List[int]

save_vocabulary(vocab_path: str)[source]

Save the tokenizer vocabulary to a file.

Parameters

vocab_path (obj: str) – The directory in which to save the SMILES character per line vocabulary file. Default vocab file is found in deepchem/feat/tests/data/vocab.txt

Returns

vocab_file – Paths to the files saved. typle with string to a SMILES character per line vocabulary file. Default vocab file is found in deepchem/feat/tests/data/vocab.txt

Return type

Tuple

BasicSmilesTokenizer

The dc.feat.BasicSmilesTokenizer module uses a regex tokenization pattern to tokenise SMILES strings. The regex is developed by Schwaller et. al. The tokenizer is to be used on SMILES in cases where the user wishes to not rely on the transformers API.

References:

class BasicSmilesTokenizer(regex_pattern: str = '(\\[[^\\]]+]|Br?|Cl?|N|O|S|P|F|I|b|c|n|o|s|p|\\(|\\)|\\.|=|#|-|\\+|\\\\|\\/|:|~|@|\\?|>>?|\\*|\\$|\\%[0-9]{2}|[0-9])')[source]

Run basic SMILES tokenization using a regex pattern developed by Schwaller et. al. This tokenizer is to be used when a tokenizer that does not require the transformers library by HuggingFace is required.

Examples

>>> from deepchem.feat.smiles_tokenizer import BasicSmilesTokenizer
>>> tokenizer = BasicSmilesTokenizer()
>>> print(tokenizer.tokenize("CC(=O)OC1=CC=CC=C1C(=O)O"))
['C', 'C', '(', '=', 'O', ')', 'O', 'C', '1', '=', 'C', 'C', '=', 'C', 'C', '=', 'C', '1', 'C', '(', '=', 'O', ')', 'O']

References

1

Philippe Schwaller, Teodoro Laino, Théophile Gaudin, Peter Bolgar, Christopher A. Hunter, Costas Bekas, and Alpha A. Lee ACS Central Science 2019 5 (9): Molecular Transformer: A Model for Uncertainty-Calibrated Chemical Reaction Prediction 1572-1583 DOI: 10.1021/acscentsci.9b00576

__init__(regex_pattern: str = '(\\[[^\\]]+]|Br?|Cl?|N|O|S|P|F|I|b|c|n|o|s|p|\\(|\\)|\\.|=|#|-|\\+|\\\\|\\/|:|~|@|\\?|>>?|\\*|\\$|\\%[0-9]{2}|[0-9])')[source]

Constructs a BasicSMILESTokenizer.

Parameters

regex (string) – SMILES token regex

tokenize(text)[source]

Basic Tokenization of a SMILES.

Other Featurizers

BindingPocketFeaturizer

class BindingPocketFeaturizer[source]

Featurizes binding pockets with information about chemical environments.

In many applications, it’s desirable to look at binding pockets on macromolecules which may be good targets for potential ligands or other molecules to interact with. A BindingPocketFeaturizer expects to be given a macromolecule, and a list of pockets to featurize on that macromolecule. These pockets should be of the form produced by a dc.dock.BindingPocketFinder, that is as a list of dc.utils.CoordinateBox objects.

The base featurization in this class’s featurization is currently very simple and counts the number of residues of each type present in the pocket. It’s likely that you’ll want to overwrite this implementation for more sophisticated downstream usecases. Note that this class’s implementation will only work for proteins and not for other macromolecules

Note

This class requires mdtraj to be installed.

featurize(protein_file: str, pockets: List[deepchem.utils.coordinate_box_utils.CoordinateBox])numpy.ndarray[source]

Calculate atomic coodinates.

Parameters
  • protein_file (str) – Location of PDB file. Will be loaded by MDTraj

  • pockets (List[CoordinateBox]) – List of dc.utils.CoordinateBox objects.

Returns

A numpy array of shale (len(pockets), n_residues)

Return type

np.ndarray

UserDefinedFeaturizer

class UserDefinedFeaturizer(feature_fields)[source]

Directs usage of user-computed featurizations.

__init__(feature_fields)[source]

Creates user-defined-featurizer.

featurize(datapoints: Iterable[Any], log_every_n: int = 1000)numpy.ndarray[source]

Calculate features for datapoints.

Parameters
  • datapoints (Iterable[Any]) – A sequence of objects that you’d like to featurize. Subclassses of Featurizer should instantiate the _featurize method that featurizes objects in the sequence.

  • log_every_n (int, default 1000) – Logs featurization progress every log_every_n steps.

Returns

A numpy array containing a featurized representation of datapoints.

Return type

np.ndarray

DummyFeaturizer

class DummyFeaturizer[source]

Class that implements a no-op featurization. This is useful when the raw dataset has to be used without featurizing the examples. The Molnet loader requires a featurizer input and such datasets can be used in their original form by passing the raw featurizer.

Examples

>>> import deepchem as dc
>>> smi_map = [["N#C[S-].O=C(CBr)c1ccc(C(F)(F)F)cc1>CCO.[K+]", "N#CSCC(=O)c1ccc(C(F)(F)F)cc1"], ["C1COCCN1.FCC(Br)c1cccc(Br)n1>CCN(C(C)C)C(C)C.CN(C)C=O.O", "FCC(c1cccc(Br)n1)N1CCOCC1"]]
>>> Featurizer = dc.feat.DummyFeaturizer()
>>> smi_feat = Featurizer.featurize(smi_map)
>>> smi_feat
array([['N#C[S-].O=C(CBr)c1ccc(C(F)(F)F)cc1>CCO.[K+]',
        'N#CSCC(=O)c1ccc(C(F)(F)F)cc1'],
       ['C1COCCN1.FCC(Br)c1cccc(Br)n1>CCN(C(C)C)C(C)C.CN(C)C=O.O',
        'FCC(c1cccc(Br)n1)N1CCOCC1']], dtype='<U55')
featurize(datapoints: Iterable[Any], log_every_n: int = 1000)numpy.ndarray[source]

Passes through dataset, and returns the datapoint.

Parameters

datapoints (Iterable[Any]) – A sequence of objects that you’d like to featurize.

Returns

datapoints – A numpy array containing a featurized representation of the datapoints.

Return type

np.ndarray

Base Featurizers (for develop)

Featurizer

The dc.feat.Featurizer class is the abstract parent class for all featurizers.

class Featurizer[source]

Abstract class for calculating a set of features for a datapoint.

This class is abstract and cannot be invoked directly. You’ll likely only interact with this class if you’re a developer. In that case, you might want to make a child class which implements the _featurize method for calculating features for a single datapoints if you’d like to make a featurizer for a new datatype.

featurize(datapoints: Iterable[Any], log_every_n: int = 1000)numpy.ndarray[source]

Calculate features for datapoints.

Parameters
  • datapoints (Iterable[Any]) – A sequence of objects that you’d like to featurize. Subclassses of Featurizer should instantiate the _featurize method that featurizes objects in the sequence.

  • log_every_n (int, default 1000) – Logs featurization progress every log_every_n steps.

Returns

A numpy array containing a featurized representation of datapoints.

Return type

np.ndarray

MolecularFeaturizer

If you’re creating a new featurizer that featurizes molecules, you will want to inherit from the abstract MolecularFeaturizer base class. This featurizer can take RDKit mol objects or SMILES as inputs.

class MolecularFeaturizer[source]

Abstract class for calculating a set of features for a molecule.

The defining feature of a MolecularFeaturizer is that it uses SMILES strings and RDKit molecule objects to represent small molecules. All other featurizers which are subclasses of this class should plan to process input which comes as smiles strings or RDKit molecules.

Child classes need to implement the _featurize method for calculating features for a single molecule.

Note

The subclasses of this class require RDKit to be installed.

featurize(molecules, log_every_n=1000)numpy.ndarray[source]

Calculate features for molecules.

Parameters
  • molecules (rdkit.Chem.rdchem.Mol / SMILES string / iterable) – RDKit Mol, or SMILES string or iterable sequence of RDKit mols/SMILES strings.

  • log_every_n (int, default 1000) – Logging messages reported every log_every_n samples.

Returns

features – A numpy array containing a featurized representation of datapoints.

Return type

np.ndarray

MaterialCompositionFeaturizer

If you’re creating a new featurizer that featurizes compositional formulas, you will want to inherit from the abstract MaterialCompositionFeaturizer base class.

class MaterialCompositionFeaturizer[source]

Abstract class for calculating a set of features for an inorganic crystal composition.

The defining feature of a MaterialCompositionFeaturizer is that it operates on 3D crystal chemical compositions. Inorganic crystal compositions are represented by Pymatgen composition objects. Featurizers for inorganic crystal compositions that are subclasses of this class should plan to process input which comes as Pymatgen composition objects.

This class is abstract and cannot be invoked directly. You’ll likely only interact with this class if you’re a developer. Child classes need to implement the _featurize method for calculating features for a single crystal composition.

Note

Some subclasses of this class will require pymatgen and matminer to be installed.

featurize(compositions: Iterable[str], log_every_n: int = 1000)numpy.ndarray[source]

Calculate features for crystal compositions.

Parameters
  • compositions (Iterable[str]) – Iterable sequence of composition strings, e.g. “MoS2”.

  • log_every_n (int, default 1000) – Logging messages reported every log_every_n samples.

Returns

features – A numpy array containing a featurized representation of compositions.

Return type

np.ndarray

MaterialStructureFeaturizer

If you’re creating a new featurizer that featurizes inorganic crystal structure, you will want to inherit from the abstract MaterialCompositionFeaturizer base class. This featurizer can take pymatgen structure objects or dictionaries as inputs.

class MaterialStructureFeaturizer[source]

Abstract class for calculating a set of features for an inorganic crystal structure.

The defining feature of a MaterialStructureFeaturizer is that it operates on 3D crystal structures with periodic boundary conditions. Inorganic crystal structures are represented by Pymatgen structure objects. Featurizers for inorganic crystal structures that are subclasses of this class should plan to process input which comes as pymatgen structure objects.

This class is abstract and cannot be invoked directly. You’ll likely only interact with this class if you’re a developer. Child classes need to implement the _featurize method for calculating features for a single crystal structure.

Note

Some subclasses of this class will require pymatgen and matminer to be installed.

featurize(structures: Iterable[Union[Dict[str, Any], Any]], log_every_n: int = 1000)numpy.ndarray[source]

Calculate features for crystal structures.

Parameters
  • structures (Iterable[Union[Dict, pymatgen.core.Structure]]) – Iterable sequence of pymatgen structure dictionaries or pymatgen.core.Structure. Please confirm the dictionary representations of pymatgen.core.Structure from https://pymatgen.org/pymatgen.core.structure.html.

  • log_every_n (int, default 1000) – Logging messages reported every log_every_n samples.

Returns

features – A numpy array containing a featurized representation of structures.

Return type

np.ndarray

ComplexFeaturizer

If you’re creating a new featurizer that featurizes a pair of ligand molecules and proteins, you will want to inherit from the abstract ComplexFeaturizer base class. This featurizer can take a pair of PDB or SDF files which contain ligand molecules and proteins.

class ComplexFeaturizer[source]

” Abstract class for calculating features for mol/protein complexes.

featurize(complexes: Iterable[Tuple[str, str]], log_every_n: int = 100)numpy.ndarray[source]

Calculate features for mol/protein complexes.

Parameters

complexes (Iterable[Tuple[str, str]]) – List of filenames (PDB, SDF, etc.) for ligand molecules and proteins. Each element should be a tuple of the form (ligand_filename, protein_filename).

Returns

features – Array of features

Return type

np.ndarray