Featurizers¶

DeepChem contains an extensive collection of featurizers. If you haven’t run into this terminology before, a “featurizer” is chunk of code which transforms raw input data into a processed form suitable for machine learning. Machine learning methods often need data to be pre-chewed for them to process. Think of this like a mama penguin chewing up food so the baby penguin can digest it easily.

Now if you’ve watched a few introductory deep learning lectures, you might ask, why do we need something like a featurizer? Isn’t part of the promise of deep learning that we can learn patterns directly from raw data?

Unfortunately it turns out that deep learning techniques need featurizers just like normal machine learning methods do. Arguably, they are less dependent on sophisticated featurizers and more capable of learning sophisticated patterns from simpler data. But nevertheless, deep learning systems can’t simply chew up raw files. For this reason, deepchem provides an extensive collection of featurization methods which we will review on this page.

Molecule Featurizers ¶

These featurizers work with datasets of molecules.

Graph Convolution Featurizers ¶

We are simplifying our graph convolution models by a joint data representation (GraphData) in a future version of DeepChem, so we provide several featurizers.

ConvMolFeaturizer and WeaveFeaturizer are used with graph convolution models which inherited KerasModel. ConvMolFeaturizer is used with graph convolution models except WeaveModel. WeaveFeaturizer are only used with WeaveModel. On the other hand, MolGraphConvFeaturizer is used with graph convolution models which inherited TorchModel. MolGanFeaturizer will be used with MolGAN model, a GAN model for generation of small molecules.

ConvMolFeaturizer ¶

class ConvMolFeaturizer(master_atom: bool = False, use_chirality: bool = False, atom_properties: Iterable[str] = [], per_atom_fragmentation: bool = False)[source]¶

This class implements the featurization to implement Duvenaud graph convolutions.

Duvenaud graph convolutions [1]_ construct a vector of descriptors for each atom in a molecule. The featurizer computes that vector of local descriptors.

Examples

>>> import deepchem as dc
>>> smiles = ["C", "CCC"]
>>> featurizer=dc.feat.ConvMolFeaturizer(per_atom_fragmentation=False)
>>> f = featurizer.featurize(smiles)
>>> # Using ConvMolFeaturizer to create featurized fragments derived from molecules of interest.
... # This is used only in the context of performing interpretation of models using atomic
... # contributions (atom-based model interpretation)
... smiles = ["C", "CCC"]
>>> featurizer=dc.feat.ConvMolFeaturizer(per_atom_fragmentation=True)
>>> f = featurizer.featurize(smiles)
>>> len(f) # contains 2 lists with  featurized fragments from 2 mols
2

WeaveFeaturizer ¶

class WeaveFeaturizer(graph_distance: bool = True, explicit_H: bool = False, use_chirality: bool = False, max_pair_distance: int | None = None)[source]¶

This class implements the featurization to implement Weave convolutions.

Weave convolutions were introduced in [1]_. Unlike Duvenaud graph convolutions, weave convolutions require a quadratic matrix of interaction descriptors for each pair of atoms. These extra descriptors may provide for additional descriptive power but at the cost of a larger featurized dataset.

Examples

>>> import deepchem as dc
>>> mols = ["CCC"]
>>> featurizer = dc.feat.WeaveFeaturizer()
>>> features = featurizer.featurize(mols)
>>> type(features[0])
<class 'deepchem.feat.mol_graphs.WeaveMol'>
>>> features[0].get_num_atoms() # 3 atoms in compound
3
>>> features[0].get_num_features() # feature size
75
>>> type(features[0].get_atom_features())
<class 'numpy.ndarray'>
>>> features[0].get_atom_features().shape
(3, 75)
>>> type(features[0].get_pair_features())
<class 'numpy.ndarray'>
>>> features[0].get_pair_features().shape
(9, 14)

References

Note

This class requires RDKit to be installed.

featurize(datapoints, log_every_n=1000, **kwargs) → ndarray[source]¶

Calculate features for molecules.

Parameters:

datapoints (rdkit.Chem.rdchem.Mol / SMILES string / iterable) – RDKit Mol, or SMILES string or iterable sequence of RDKit mols/SMILES strings.
log_every_n (int, default 1000) – Logging messages reported every log_every_n samples.

Returns:

features – A numpy array containing a featurized representation of datapoints.

Return type:

np.ndarray

__init__(graph_distance: bool = True, explicit_H: bool = False, use_chirality: bool = False, max_pair_distance: int | None = None)[source]¶

Initialize this featurizer with set parameters.

Parameters:

graph_distance (bool, (default True)) – If True, use graph distance for distance features. Otherwise, use Euclidean distance. Note that this means that molecules that this featurizer is invoked on must have valid conformer information if this option is set.
explicit_H (bool, (default False)) – If true, model hydrogens in the molecule.
use_chirality (bool, (default False)) – If true, use chiral information in the featurization
max_pair_distance (Optional[int], (default None)) – This value can be a positive integer or None. This parameter determines the maximum graph distance at which pair features are computed. For example, if max_pair_distance==2, then pair features are computed only for atoms at most graph distance 2 apart. If max_pair_distance is None, all pairs are considered (effectively infinite max_pair_distance)

MolGanFeaturizer ¶

class MolGanFeaturizer(max_atom_count: int = 9, kekulize: bool = True, bond_labels: List[Any] | None = None, atom_labels: List[int] | None = None)[source]¶

Featurizer for MolGAN de-novo molecular generation [1]_. The default representation is in form of GraphMatrix object. It is wrapper for two matrices containing atom and bond type information. The class also provides reverse capabilities.

Examples

>>> import deepchem as dc
>>> from rdkit import Chem
>>> rdkit_mol, smiles_mol = Chem.MolFromSmiles('CCC'), 'C1=CC=CC=C1'
>>> molecules = [rdkit_mol, smiles_mol]
>>> featurizer = dc.feat.MolGanFeaturizer()
>>> features = featurizer.featurize(molecules)
>>> len(features) # 2 molecules
2
>>> type(features[0])
<class 'deepchem.feat.molecule_featurizers.molgan_featurizer.GraphMatrix'>
>>> molecules = featurizer.defeaturize(features) # defeaturization
>>> type(molecules[0])
<class 'rdkit.Chem.rdchem.Mol'>

__init__(max_atom_count: int = 9, kekulize: bool = True, bond_labels: List[Any] | None = None, atom_labels: List[int] | None = None)[source]¶

Parameters:

max_atom_count (int, default 9) – Maximum number of atoms used for creation of adjacency matrix. Molecules cannot have more atoms than this number Implicit hydrogens do not count.
kekulize (bool, default True) – Should molecules be kekulized. Solves number of issues with defeaturization when used.
bond_labels (List[RDKitBond]) – List of types of bond used for generation of adjacency matrix
atom_labels (List[int]) – List of atomic numbers used for generation of node features

References

featurize(datapoints, log_every_n=1000, **kwargs) → ndarray[source]¶

Calculate features for molecules.

Parameters:

datapoints (rdkit.Chem.rdchem.Mol / SMILES string / iterable) – RDKit Mol, or SMILES string or iterable sequence of RDKit mols/SMILES strings.
log_every_n (int, default 1000) – Logging messages reported every log_every_n samples.

Returns:

features – A numpy array containing a featurized representation of datapoints.

Return type:

np.ndarray

defeaturize(graphs: GraphMatrix | Sequence[GraphMatrix], log_every_n: int = 1000) → ndarray[source]¶

Calculates molecules from corresponding GraphMatrix objects.

Parameters:

graphs (GraphMatrix / iterable) – GraphMatrix object or corresponding iterable
log_every_n (int, default 1000) – Logging messages reported every log_every_n samples.

Returns:

features – A numpy array containing RDKitMol objext.

Return type:

np.ndarray

MolGraphConvFeaturizer ¶

class MolGraphConvFeaturizer(use_edges: bool = False, use_chirality: bool = False, use_partial_charge: bool = False)[source]¶

This class is a featurizer of general graph convolution networks for molecules.

The default node(atom) and edge(bond) representations are based on WeaveNet paper. If you want to use your own representations, you could use this class as a guide to define your original Featurizer. In many cases, it’s enough to modify return values of construct_atom_feature or construct_bond_feature.

The default node representation are constructed by concatenating the following values, and the feature length is 30.

Atom type: A one-hot vector of this atom, “C”, “N”, “O”, “F”, “P”, “S”, “Cl”, “Br”, “I”, “other atoms”.
Formal charge: Integer electronic charge.
Hybridization: A one-hot vector of “sp”, “sp2”, “sp3”.
Hydrogen bonding: A one-hot vector of whether this atom is a hydrogen bond donor or acceptor.
Aromatic: A one-hot vector of whether the atom belongs to an aromatic ring.
Degree: A one-hot vector of the degree (0-5) of this atom.
Number of Hydrogens: A one-hot vector of the number of hydrogens (0-4) that this atom connected.
Chirality: A one-hot vector of the chirality, “R” or “S”. (Optional)
Partial charge: Calculated partial charge. (Optional)

The default edge representation are constructed by concatenating the following values, and the feature length is 11.

Bond type: A one-hot vector of the bond type, “single”, “double”, “triple”, or “aromatic”.
Same ring: A one-hot vector of whether the atoms in the pair are in the same ring.
Conjugated: A one-hot vector of whether this bond is conjugated or not.
Stereo: A one-hot vector of the stereo configuration of a bond.

If you want to know more details about features, please check the paper [1]_ and utilities in deepchem.utils.molecule_feature_utils.py.

Examples

>>> smiles = ["C1CCC1", "C1=CC=CN=C1"]
>>> featurizer = MolGraphConvFeaturizer(use_edges=True)
>>> out = featurizer.featurize(smiles)
>>> type(out[0])
<class 'deepchem.feat.graph_data.GraphData'>
>>> out[0].num_node_features
30
>>> out[0].num_edge_features
11

References

Note

This class requires RDKit to be installed.

__init__(use_edges: bool = False, use_chirality: bool = False, use_partial_charge: bool = False)[source]¶

Parameters:

use_edges (bool, default False) – Whether to use edge features or not.
use_chirality (bool, default False) – Whether to use chirality information or not. If True, featurization becomes slow.
use_partial_charge (bool, default False) – Whether to use partial charge data or not. If True, this featurizer computes gasteiger charges. Therefore, there is a possibility to fail to featurize for some molecules and featurization becomes slow.

featurize(datapoints, log_every_n=1000, **kwargs) → ndarray[source]¶

Calculate features for molecules.

Parameters:

datapoints (rdkit.Chem.rdchem.Mol / SMILES string / iterable) – RDKit Mol, or SMILES string or iterable sequence of RDKit mols/SMILES strings.
log_every_n (int, default 1000) – Logging messages reported every log_every_n samples.

Returns:

features – A numpy array containing a featurized representation of datapoints.

Return type:

np.ndarray

EquivariantGraphFeaturizer ¶

class EquivariantGraphFeaturizer(fully_connected: bool = False, weight_bins: List[Any] | None = None, embeded: bool = False, degree: int = 3)[source]¶

Featurizer for Equivariant Graph Neural Networks.

This featurizer constructs graph representations of molecular structures, capturing atomic features, pairwise distances, and spatial positions. These features are tailored for use in Equivariant models with QM9 dataset as described in [1]_.

Features include: - Node features: Atomic one-hot encodings and additional descriptors. - Edge features: Vector displacements between atom pairs. - Edge weights: Discretized pairwise distances in one-hot encoding. - Atomic coordinates: 3D positions of atoms.

Examples

>>> from rdkit import Chem
>>> import deepchem as dc
>>> mol = Chem.MolFromSmiles('CCO')
>>> featurizer = dc.feat.EquivariantGraphFeaturizer(fully_connected=True, embeded=True)
>>> features = featurizer.featurize([mol])
>>> type(features[0])
<class 'deepchem.feat.graph_data.GraphData'>
>>> features[0].node_features.shape  # (N, F)
(3, 6)

Notes

This class requires RDKit to be installed.

References

__init__(fully_connected: bool = False, weight_bins: List[Any] | None = None, embeded: bool = False, degree: int = 3)[source]¶

Parameters:

fully_connected (bool, optional (default False)) – If True, generates fully connected graphs with distance-based edges.
weight_bins (list of float, optional (default [1.0, 2.0, 3.0, 4.0])) – Bin boundaries for discretizing edge weights.
embeded (bool, optional (default False)) – Whether to embed 3D coordinates using RDKit’s ETKDG method.

featurize(datapoints, log_every_n=1000, **kwargs) → ndarray[source]¶

Calculate features for molecules.

Parameters:

datapoints (rdkit.Chem.rdchem.Mol / SMILES string / iterable) – RDKit Mol, or SMILES string or iterable sequence of RDKit mols/SMILES strings.
log_every_n (int, default 1000) – Logging messages reported every log_every_n samples.

Returns:

features – A numpy array containing a featurized representation of datapoints.

Return type:

np.ndarray

PagtnMolGraphFeaturizer ¶

class PagtnMolGraphFeaturizer(max_length=5)[source]¶

This class is a featuriser of PAGTN graph networks for molecules.

The featurization is based on PAGTN model. It is slightly more computationally intensive than default Graph Convolution Featuriser, but it builds a Molecular Graph connecting all atom pairs accounting for interactions of an atom with every other atom in the Molecule. According to the paper, interactions between two pairs of atom are dependent on the relative distance between them and and hence, the function needs to calculate the shortest path between them.

The default node representation is constructed by concatenating the following values, and the feature length is 94.

Atom type: One hot encoding of the atom type. It consists of the most possible elements in a chemical compound.
Formal charge: One hot encoding of formal charge of the atom.
Degree: One hot encoding of the atom degree
Explicit Valence: One hot encoding of explicit valence of an atom. The supported possibilities
include 0 - 6.
Implicit Valence: One hot encoding of implicit valence of an atom. The supported possibilities
include 0 - 5.
Aromaticity: Boolean representing if an atom is aromatic.

The default edge representation is constructed by concatenating the following values, and the feature length is 42. It builds a complete graph where each node is connected to every other node. The edge representations are calculated based on the shortest path between two nodes (choose any one if multiple exist). Each bond encountered in the shortest path is used to calculate edge features.

Bond type: A one-hot vector of the bond type, “single”, “double”, “triple”, or “aromatic”.
Conjugated: A one-hot vector of whether this bond is conjugated or not.
Same ring: A one-hot vector of whether the atoms in the pair are in the same ring.
Ring Size and Aromaticity: One hot encoding of atoms in pair based on ring size and aromaticity.
Distance: One hot encoding of the distance between pair of atoms.

Examples

>>> from deepchem.feat import PagtnMolGraphFeaturizer
>>> smiles = ["C1CCC1", "C1=CC=CN=C1"]
>>> featurizer = PagtnMolGraphFeaturizer(max_length=5)
>>> out = featurizer.featurize(smiles)
>>> type(out[0])
<class 'deepchem.feat.graph_data.GraphData'>
>>> out[0].num_node_features
94
>>> out[0].num_edge_features
42

References

Note

This class requires RDKit to be installed.

__init__(max_length=5)[source]¶

Parameters:: max_length (int) – Maximum distance up to which shortest paths must be considered. Paths shorter than max_length will be padded and longer will be truncated, default to 5.

featurize(datapoints, log_every_n=1000, **kwargs) → ndarray[source]¶

Calculate features for molecules.

Parameters:

datapoints (rdkit.Chem.rdchem.Mol / SMILES string / iterable) – RDKit Mol, or SMILES string or iterable sequence of RDKit mols/SMILES strings.
log_every_n (int, default 1000) – Logging messages reported every log_every_n samples.

Returns:

features – A numpy array containing a featurized representation of datapoints.

Return type:

np.ndarray

DMPNNFeaturizer ¶

class DMPNNFeaturizer(features_generators: List[str] | None = None, is_adding_hs: bool = False, use_original_atom_ranks: bool = False)[source]¶

This class is a featurizer for Directed Message Passing Neural Network (D-MPNN) implementation

The default node(atom) and edge(bond) representations are based on Analyzing Learned Molecular Representations for Property Prediction paper.

The default node representation are constructed by concatenating the following values, and the feature length is 133.

Atomic num: A one-hot vector of this atom, in a range of first 100 atoms.
Degree: A one-hot vector of the degree (0-5) of this atom.
Formal charge: Integer electronic charge, -1, -2, 1, 2, 0.
Chirality: A one-hot vector of the chirality tag (0-3) of this atom.
Number of Hydrogens: A one-hot vector of the number of hydrogens (0-4) that this atom connected.
Hybridization: A one-hot vector of “SP”, “SP2”, “SP3”, “SP3D”, “SP3D2”.
Aromatic: A one-hot vector of whether the atom belongs to an aromatic ring.
Mass: Atomic mass * 0.01

The default edge representation are constructed by concatenating the following values, and the feature length is 14.

Bond type: A one-hot vector of the bond type, “single”, “double”, “triple”, or “aromatic”.
Same ring: A one-hot vector of whether the atoms in the pair are in the same ring.
Conjugated: A one-hot vector of whether this bond is conjugated or not.
Stereo: A one-hot vector of the stereo configuration (0-5) of a bond.

If you want to know more details about features, please check the paper [1]_ and utilities in deepchem.utils.molecule_feature_utils.py.

Examples

>>> smiles = ["C1=CC=CN=C1", "C1CCC1"]
>>> featurizer = DMPNNFeaturizer()
>>> out = featurizer.featurize(smiles)
>>> type(out[0])
<class 'deepchem.feat.graph_data.GraphData'>
>>> out[0].num_nodes
6
>>> out[0].num_node_features
133
>>> out[0].node_features.shape
(6, 133)
>>> out[0].num_edge_features
14
>>> out[0].num_edges
12
>>> out[0].edge_features.shape
(12, 14)

References

Note

This class requires RDKit to be installed.

__init__(features_generators: List[str] | None = None, is_adding_hs: bool = False, use_original_atom_ranks: bool = False)[source]¶

Parameters:

features_generator (List[str], default None) – List of global feature generators to be used.
is_adding_hs (bool, default False) – Whether to add Hs or not.
use_original_atom_ranks (bool, default False) – Whether to use original atom mapping or canonical atom mapping

featurize(datapoints, log_every_n=1000, **kwargs) → ndarray[source]¶

Calculate features for molecules.

Parameters:

datapoints (rdkit.Chem.rdchem.Mol / SMILES string / iterable) – RDKit Mol, or SMILES string or iterable sequence of RDKit mols/SMILES strings.
log_every_n (int, default 1000) – Logging messages reported every log_every_n samples.

Returns:

features – A numpy array containing a featurized representation of datapoints.

Return type:

np.ndarray

ProteinMPNNFeaturizer ¶

class ProteinMPNNFeaturizer(design_chains: List[str] | None = None)[source]¶

This featurizer is for the ProteinMPNN model that parses a raw PDB file, extracts the 3D atomic coordinates of the protein backbone (N, CA, C, O), derives the amino acid sequence, and formats them into the specific numerical tensors required by the model. The model uses a tokenizer to convert the amino acid sequence into a sequence of integer. The tokenizer is the same as the one used in the ProteinMPNN model.

Examples

>>> import os
>>> import tempfile
>>> from deepchem.feat.ProteinMPNN_featurizer import ProteinMPNNFeaturizer
>>> # Create a temporary mock PDB file with 1 residue (Alanine)
>>> temp_dir = tempfile.mkdtemp()
>>> mock_pdb_path = os.path.join(temp_dir, 'mock.pdb')
>>> with open(mock_pdb_path, 'w') as f:
...     _ = f.write("ATOM      1  N   ALA A   1       0.000   0.000   0.000  1.00  0.00           N  \n")
...     _ = f.write("ATOM      2  CA  ALA A   1       1.458   0.000   0.000  1.00  0.00           C  \n")
...     _ = f.write("ATOM      3  C   ALA A   1       2.009   1.424   0.000  1.00  0.00           C  \n")
...     _ = f.write("ATOM      4  O   ALA A   1       1.319   2.441   0.000  1.00  0.00           O  \n")
>>> featurizer = ProteinMPNNFeaturizer()
>>> features = featurizer.featurize([mock_pdb_path])
>>> len(features)
1

References

__init__(design_chains: List[str] | None = None)[source]¶

featurize(datapoints: Iterable[Any], log_every_n: int = 1000, **kwargs) → ndarray[source]¶

Calculate features for datapoints.

Parameters:

datapoints (Iterable[Any]) – A sequence of objects that you’d like to featurize. Subclassses of Featurizer should instantiate the _featurize method that featurizes objects in the sequence.
log_every_n (int, default 1000) – Logs featurization progress every log_every_n steps.

Returns:

A numpy array containing a featurized representation of datapoints.

Return type:

np.ndarray

GroverFeaturizer ¶

class GroverFeaturizer(features_generator: MolecularFeaturizer | None = None, bond_drop_rate: float = 0.0)[source]¶

Featurizer for GROVER Model

The Grover Featurizer is used to compute features suitable for grover model. It accepts an rdkit molecule of type rdkit.Chem.rdchem.Mol or a SMILES string as input and computes the following sets of features:

a molecular graph from the input molecule

functional groups which are used only during pretraining

additional features which can only be used during finetuning

Parameters:

additional_featurizer (dc.feat.Featurizer) – Given a molecular dataset, it is possible to extract additional molecular features in order
can (to train and finetune from the existing pretrained model. The additional_featurizer) –
molecule. (be used to generate additional features for the) –

References

Examples

>>> import deepchem as dc
>>> from deepchem.feat import GroverFeaturizer
>>> feat = GroverFeaturizer(features_generator = dc.feat.CircularFingerprint())
>>> out = feat.featurize('CCC')

Note

This class requires RDKit to be installed.

__init__(features_generator: MolecularFeaturizer | None = None, bond_drop_rate: float = 0.0)[source]¶

Parameters:: use_original_atoms_order (bool, default False) – Whether to use original atom ordering or canonical ordering (default)

RDKitConformerFeaturizer ¶

class RDKitConformerFeaturizer(use_original_atoms_order=False)[source]¶

A featurizer that featurizes an RDKit mol object as a GraphData object with 3D coordinates. The 3D coordinates are represented in the node_pos_features attribute of the GraphData object of shape [num_atoms * num_conformers, 3].

The ETKDGv2 algorithm is used to generate 3D coordinates for the molecule. The RDKit source for this algorithm can be found in RDkit/Code/GraphMol/DistGeomHelpers/Embedder.cpp The documentation can be found here: https://rdkit.org/docs/source/rdkit.Chem.rdDistGeom.html#rdkit.Chem.rdDistGeom.ETKDGv2

This featurization requires RDKit.

Examples

>>> from deepchem.feat.molecule_featurizers.conformer_featurizer import RDKitConformerFeaturizer
>>> featurizer = RDKitConformerFeaturizer()
>>> molecule = "CCO"
>>> conformer = featurizer.featurize(molecule)
>>> print (type(conformer[0]))
<class 'deepchem.feat.graph_data.GraphData'>

atom_to_feature_vector(atom)[source]¶

Converts an RDKit atom object to a feature list of indices.

Parameters:: atom (Chem.rdchem.Atom) – RDKit atom object.
Returns:: List of feature indices for the given atom.
Return type:: List[int]

bond_to_feature_vector(bond)[source]¶

Converts an RDKit bond object to a feature list of indices.

Parameters:: bond (Chem.rdchem.Bond) – RDKit bond object.
Returns:: List of feature indices for the given bond.
Return type:: List[int]

MXMNetFeaturizer ¶

class MXMNetFeaturizer(is_adding_hs: bool = False)[source]¶

This class is a featurizer for Multiplex Molecular Graph Neural Network (MXMNet) implementation.

The atomic numbers(indices) of atoms will be used later to generate randomly initialized trainable embeddings to be the input node embeddings.

This featurizer is based on Molecular Mechanics-Driven Graph Neural Network with Multiplex Graph for Molecular Structures.

Examples

>>> smiles = ["C1=CC=CN=C1", "C1CCC1"]
>>> featurizer = MXMNetFeaturizer()
>>> out = featurizer.featurize(smiles)
>>> type(out[0])
<class 'deepchem.feat.graph_data.GraphData'>
>>> out[0].num_nodes
6
>>> out[0].num_node_features
1
>>> out[0].node_features.shape
(6, 1)
>>> out[0].num_edges
12

Note

We are not explitly handling hydrogen atoms for now. We only support ‘H’, ‘C’, ‘N’, ‘O’ and ‘F’ atoms to be present in the smiles at this point for MXMNet Model.

__init__(is_adding_hs: bool = False)[source]¶

Parameters:: is_adding_hs (bool, default False) – Whether to add Hs or not.

featurize(datapoints, log_every_n=1000, **kwargs) → ndarray[source]¶

Calculate features for molecules.

Parameters:

datapoints (rdkit.Chem.rdchem.Mol / SMILES string / iterable) – RDKit Mol, or SMILES string or iterable sequence of RDKit mols/SMILES strings.
log_every_n (int, default 1000) – Logging messages reported every log_every_n samples.

Returns:

features – A numpy array containing a featurized representation of datapoints.

Return type:

np.ndarray

Utilities ¶

Here are some constants that are used by the graph convolutional featurizers for molecules.

class GraphConvConstants[source]¶

This class defines a collection of constants which are useful for graph convolutions on molecules.

possible_atom_list = ['C', 'N', 'O', 'S', 'F', 'P', 'Cl', 'Mg', 'Na', 'Br', 'Fe', 'Ca', 'Cu', 'Mc', 'Pd', 'Pb', 'K', 'I', 'Al', 'Ni', 'Mn'][source]¶: Allowed Numbers of Hydrogens

possible_numH_list = [0, 1, 2, 3, 4][source]¶: Allowed Valences for Atoms

possible_valence_list = [0, 1, 2, 3, 4, 5, 6][source]¶: Allowed Formal Charges for Atoms

possible_formal_charge_list = [-3, -2, -1, 0, 1, 2, 3][source]¶: This is a placeholder for documentation. These will be replaced with corresponding values of the rdkit HybridizationType

possible_hybridization_list = ['SP', 'SP2', 'SP3', 'SP3D', 'SP3D2'][source]¶: Allowed number of radical electrons.

possible_number_radical_e_list = [0, 1, 2][source]¶: Allowed types of Chirality

possible_chirality_list = ['R', 'S'][source]¶: The set of all values allowed.

reference_lists = [['C', 'N', 'O', 'S', 'F', 'P', 'Cl', 'Mg', 'Na', 'Br', 'Fe', 'Ca', 'Cu', 'Mc', 'Pd', 'Pb', 'K', 'I', 'Al', 'Ni', 'Mn'], [0, 1, 2, 3, 4], [0, 1, 2, 3, 4, 5, 6], [-3, -2, -1, 0, 1, 2, 3], [0, 1, 2], ['SP', 'SP2', 'SP3', 'SP3D', 'SP3D2'], ['R', 'S']][source]¶: The number of different values that can be taken. See get_intervals()

intervals = [1, 6, 48, 384, 1536, 9216, 27648][source]¶: Possible stereochemistry. We use E-Z notation for stereochemistry https://en.wikipedia.org/wiki/E%E2%80%93Z_notation

possible_bond_stereo = ['STEREONONE', 'STEREOANY', 'STEREOZ', 'STEREOE'][source]¶: Number of different bond types not counting stereochemistry.

bond_fdim_base = 6[source]¶

__module__ = 'deepchem.feat.graph_features'[source]¶

There are a number of helper methods used by the graph convolutional classes which we document here.

one_of_k_encoding(x, allowable_set)[source]¶

Encodes elements of a provided set as integers.

Parameters:

x (object) – Must be present in allowable_set.
allowable_set (list) – List of allowable quantities.

Example

>>> import deepchem as dc
>>> dc.feat.graph_features.one_of_k_encoding("a", ["a", "b", "c"])
[True, False, False]

Raises:: ValueError –

one_of_k_encoding_unk(x, allowable_set)[source]¶

Maps inputs not in the allowable set to the last element.

Unlike one_of_k_encoding, if x is not in allowable_set, this method pretends that x is the last element of allowable_set.

Parameters:

x (object) – Must be present in allowable_set.
allowable_set (list) – List of allowable quantities.

Examples

>>> dc.feat.graph_features.one_of_k_encoding_unk("s", ["a", "b", "c"])
[False, False, True]

get_intervals(l)[source]¶

For list of lists, gets the cumulative products of the lengths

Note that we add 1 to the lengths of all lists (to avoid an empty list propagating a 0).

Parameters:: l (list of lists) – Returns the cumulative product of these lengths.

Examples

>>> dc.feat.graph_features.get_intervals([[1], [1, 2], [1, 2, 3]])
[1, 3, 12]

>>> dc.feat.graph_features.get_intervals([[1], [], [1, 2], [1, 2, 3]])
[1, 1, 3, 12]

safe_index(l, e)[source]¶

Gets the index of e in l, providing an index of len(l) if not found

Parameters:

l (list) – List of values
e (object) – Object to check whether e is in l

Examples

>>> dc.feat.graph_features.safe_index([1, 2, 3], 1)
0
>>> dc.feat.graph_features.safe_index([1, 2, 3], 7)
3

get_feature_list(atom)[source]¶

Returns a list of possible features for this atom.

Parameters:: atom (RDKit.Chem.rdchem.Atom) – Atom to get features for

Examples

>>> from rdkit import Chem
>>> mol = Chem.MolFromSmiles("C")
>>> atom = mol.GetAtoms()[0]
>>> features = dc.feat.graph_features.get_feature_list(atom)
>>> type(features)
<class 'list'>
>>> len(features)
6

Note

This method requires RDKit to be installed.

Returns:: features – List of length 6. The i-th value in this list provides the index of the atom in the corresponding feature value list. The 6 feature values lists for this function are [GraphConvConstants.possible_atom_list, GraphConvConstants.possible_numH_list, GraphConvConstants.possible_valence_list, GraphConvConstants.possible_formal_charge_list, GraphConvConstants.possible_num_radical_e_list].
Return type:: list

features_to_id(features, intervals)[source]¶

Convert list of features into index using spacings provided in intervals

Parameters:

features (list) – List of features as returned by get_feature_list()
intervals (list) – List of intervals as returned by get_intervals()

Returns:

id – The index in a feature vector given by the given set of features.

Return type:

int

id_to_features(id, intervals)[source]¶

Given an index in a feature vector, return the original set of features.

Parameters:

id (int) – The index in a feature vector given by the given set of features.
intervals (list) – List of intervals as returned by get_intervals()

Returns:

features – List of features as returned by get_feature_list()

Return type:

list

atom_to_id(atom)[source]¶

Return a unique id corresponding to the atom type

Parameters:: atom (RDKit.Chem.rdchem.Atom) – Atom to convert to ids.
Returns:: id – The index in a feature vector given by the given set of features.
Return type:: int

This function helps compute distances between atoms from a given base atom.

find_distance(a1: Any, num_atoms: int, bond_adj_list, max_distance=7) → ndarray[source]¶

Computes distances from provided atom.

Parameters:

a1 (RDKit atom) – The source atom to compute distances from.
num_atoms (int) – The total number of atoms.
bond_adj_list (list of lists) – bond_adj_list[i] is a list of the atom indices that atom i shares a bond with. This list is symmetrical so if j in bond_adj_list[i] then i in bond_adj_list[j].
max_distance (int, optional (default 7)) – The max distance to search.

Returns:

distances – Of shape (num_atoms, max_distance). Provides a one-hot encoding of the distances. That is, distances[i] is a one-hot encoding of the distance from a1 to atom i.

Return type:

np.ndarray

This function is important and computes per-atom feature vectors used by graph convolutional featurizers.

atom_features(atom, bool_id_feat=False, explicit_H=False, use_chirality=False)[source]¶

Helper method used to compute per-atom feature vectors.

Many different featurization methods compute per-atom features such as ConvMolFeaturizer, WeaveFeaturizer. This method computes such features.

Parameters:

atom (RDKit.Chem.rdchem.Atom) – Atom to compute features on.
bool_id_feat (bool, optional) – Return an array of unique identifiers corresponding to atom type.
explicit_H (bool, optional) – If true, model hydrogens explicitly
use_chirality (bool, optional) – If true, use chirality information.

Returns:

features – An array of per-atom features.

Return type:

np.ndarray

Examples

>>> from rdkit import Chem
>>> mol = Chem.MolFromSmiles('CCC')
>>> atom = mol.GetAtoms()[0]
>>> features = dc.feat.graph_features.atom_features(atom)
>>> type(features)
<class 'numpy.ndarray'>
>>> features.shape
(75,)

This function computes the bond features used by graph convolutional featurizers.

bond_features(bond, use_chirality=False, use_extended_chirality=False)[source]¶

Helper method used to compute bond feature vectors.

Many different featurization methods compute bond features such as WeaveFeaturizer. This method computes such features.

Parameters:

bond (rdkit.Chem.rdchem.Bond) – Bond to compute features on.
use_chirality (bool, optional) – If true, use chirality information.
use_extended_chirality (bool, optional) – If true, use chirality information with upto 6 different types.

Note

This method requires RDKit to be installed.

Returns:

bond_feats (np.ndarray) – Array of bond features. This is a 1-D array of length 6 if use_chirality is False else of length 10 with chirality encoded.
bond_feats (Sequence[Union[bool, int, float]]) – List of bond features returned if use_extended_chirality is True.

Examples

>>> from rdkit import Chem
>>> mol = Chem.MolFromSmiles('CCC')
>>> bond = mol.GetBonds()[0]
>>> bond_features = dc.feat.graph_features.bond_features(bond)
>>> type(bond_features)
<class 'numpy.ndarray'>
>>> bond_features.shape
(6,)

Note

This method requires RDKit to be installed.

This function computes atom-atom features (for atom pairs which may not have bonds between them.)

pair_features(mol: Any, bond_features_map: dict, bond_adj_list: List, bt_len: int = 6, graph_distance: bool = True, max_pair_distance: int | None = None) → Tuple[ndarray, ndarray][source]¶

Helper method used to compute atom pair feature vectors.

Many different featurization methods compute atom pair features such as WeaveFeaturizer. Note that atom pair features could be for pairs of atoms which aren’t necessarily bonded to one another.

Parameters:

mol (RDKit Mol) – Molecule to compute features on.
bond_features_map (dict) – Dictionary that maps pairs of atom ids (say (2, 3) for a bond between atoms 2 and 3) to the features for the bond between them.
bond_adj_list (list of lists) – bond_adj_list[i] is a list of the atom indices that atom i shares a bond with . This list is symmetrical so if j in bond_adj_list[i] then i in bond_adj_list[j].
bt_len (int, optional (default 6)) – The number of different bond types to consider.
graph_distance (bool, optional (default True)) – If true, use graph distance between molecules. Else use euclidean distance. The specified mol must have a conformer. Atomic positions will be retrieved by calling mol.getConformer(0).
max_pair_distance (Optional[int], (default None)) – This value can be a positive integer or None. This parameter determines the maximum graph distance at which pair features are computed. For example, if max_pair_distance==2, then pair features are computed only for atoms at most graph distance 2 apart. If max_pair_distance is None, all pairs are considered (effectively infinite max_pair_distance)

Note

This method requires RDKit to be installed.

Returns:

features (np.ndarray) – Of shape (N_edges, bt_len + max_distance + 1). This is the array of pairwise features for all atom pairs, where N_edges is the number of edges within max_pair_distance of one another in this molecules.
pair_edges (np.ndarray) – Of shape (2, num_pairs) where num_pairs is the total number of pairs within max_pair_distance of one another.

MACCSKeysFingerprint ¶

class MACCSKeysFingerprint[source]¶

MACCS Keys Fingerprint.

The MACCS (Molecular ACCess System) keys are one of the most commonly used structural keys. Please confirm the details in [1]_, [2]_.

Examples

>>> import deepchem as dc
>>> smiles = 'CC(=O)OC1=CC=CC=C1C(=O)O'
>>> featurizer = dc.feat.MACCSKeysFingerprint()
>>> features = featurizer.featurize([smiles])
>>> type(features[0])
<class 'numpy.ndarray'>
>>> features[0].shape
(167,)

References

Note

This class requires RDKit to be installed.

__init__()[source]¶: Initialize this featurizer.

MATFeaturizer ¶

class MATFeaturizer[source]¶

This class is a featurizer for the Molecule Attention Transformer [1]_. The returned value is a numpy array which consists of molecular graph descriptions:

Node Features

Adjacency Matrix

Distance Matrix

References

Examples

>>> import deepchem as dc
>>> feat = dc.feat.MATFeaturizer()
>>> out = feat.featurize("CCC")

Note

This class requires RDKit to be installed.

__init__()[source]¶

Parameters:: use_original_atoms_order (bool, default False) – Whether to use original atom ordering or canonical ordering (default)

construct_mol(mol: Any) → Any[source]¶

Processes an input RDKitMol further to be able to extract id-specific Conformers from it using mol.GetConformer().

Parameters:: mol (RDKitMol) – RDKit Mol object.
Returns:: mol – A processed RDKitMol object which is embedded, UFF Optimized and has Hydrogen atoms removed. If the former conditions are not met and there is a value error, then 2D Coordinates are computed instead.
Return type:: RDKitMol

atom_features(atom: Any) → ndarray[source]¶

Deepchem already contains an atom_features function, however we are defining a new one here due to the need to handle features specific to MAT. Since we need new features like Atom GetNeighbors and IsInRing, and the number of features required for MAT is a fraction of what the Deepchem atom_features function computes, we can speed up computation by defining a custom function.

Parameters:: atom (RDKitAtom) – RDKit Atom object.
Returns:: Numpy array containing atom features.
Return type:: ndarray

construct_node_features_matrix(mol: Any) → ndarray[source]¶

This function constructs a matrix of atom features for all atoms in a given molecule using the atom_features function.

Parameters:: mol (RDKitMol) – RDKit Mol object.
Returns:: Atom_features – Numpy array containing atom features.
Return type:: ndarray

featurize(datapoints, log_every_n=1000, **kwargs) → ndarray[source]¶

Calculate features for molecules.

Parameters:

datapoints (rdkit.Chem.rdchem.Mol / SMILES string / iterable) – RDKit Mol, or SMILES string or iterable sequence of RDKit mols/SMILES strings.
log_every_n (int, default 1000) – Logging messages reported every log_every_n samples.

Returns:

features – A numpy array containing a featurized representation of datapoints.

Return type:

np.ndarray

CircularFingerprint ¶

class CircularFingerprint(radius: int = 2, size: int = 2048, chiral: bool = False, bonds: bool = True, features: bool = False, sparse: bool = False, smiles: bool = False, is_counts_based: bool = False)[source]¶

Circular (Morgan) fingerprints.

Extended Connectivity Circular Fingerprints compute a bag-of-words style representation of a molecule by breaking it into local neighborhoods and hashing into a bit vector of the specified size. It is used specifically for structure-activity modelling. See [1]_ for more details.

References

Note

This class requires RDKit to be installed.

Examples

>>> import deepchem as dc
>>> from rdkit import Chem
>>> smiles = ['C1=CC=CC=C1']
>>> # Example 1: (size = 2048, radius = 4)
>>> featurizer = dc.feat.CircularFingerprint(size=2048, radius=4)
>>> features = featurizer.featurize(smiles)
>>> type(features[0])
<class 'numpy.ndarray'>
>>> features[0].shape
(2048,)

>>> # Example 2: (size = 2048, radius = 4, sparse = True, smiles = True)
>>> featurizer = dc.feat.CircularFingerprint(size=2048, radius=8,
...                                          sparse=True, smiles=True)
>>> features = featurizer.featurize(smiles)
>>> type(features[0]) # dict containing fingerprints
<class 'dict'>

__init__(radius: int = 2, size: int = 2048, chiral: bool = False, bonds: bool = True, features: bool = False, sparse: bool = False, smiles: bool = False, is_counts_based: bool = False)[source]¶

Parameters:

radius (int, optional (default 2)) – Fingerprint radius.
size (int, optional (default 2048)) – Length of generated bit vector.
chiral (bool, optional (default False)) – Whether to consider chirality in fingerprint generation.
bonds (bool, optional (default True)) – Whether to consider bond order in fingerprint generation.
features (bool, optional (default False)) – Whether to use feature information instead of atom information; see RDKit docs for more info.
sparse (bool, optional (default False)) – Whether to return a dict for each molecule containing the sparse fingerprint.
smiles (bool, optional (default False)) – Whether to calculate SMILES strings for fragment IDs (only applicable when calculating sparse fingerprints).
is_counts_based (bool, optional (default False)) – Whether to generates a counts-based fingerprint.

featurize(datapoints, log_every_n=1000, **kwargs) → ndarray[source]¶

Calculate features for molecules.

Parameters:

datapoints (rdkit.Chem.rdchem.Mol / SMILES string / iterable) – RDKit Mol, or SMILES string or iterable sequence of RDKit mols/SMILES strings.
log_every_n (int, default 1000) – Logging messages reported every log_every_n samples.

Returns:

features – A numpy array containing a featurized representation of datapoints.

Return type:

np.ndarray

PubChemFingerprint ¶

class PubChemFingerprint[source]¶

PubChem Fingerprint.

The PubChem fingerprint is a 881 bit structural key, which is used by PubChem for similarity searching. Please confirm the details in [1]_.

References

Note

This class requires RDKit and PubChemPy to be installed. PubChemPy use REST API to get the fingerprint, so you need the internet access.

Examples

>>> import deepchem as dc
>>> smiles = ['CCC']
>>> featurizer = dc.feat.PubChemFingerprint()
>>> features = featurizer.featurize(smiles)
>>> type(features[0])
<class 'numpy.ndarray'>
>>> features[0].shape
(881,)

__init__()[source]¶: Initialize this featurizer.

Mol2VecFingerprint ¶

class Mol2VecFingerprint(pretrain_model_path: str | None = None, radius: int = 1, unseen: str = 'UNK')[source]¶

Mol2Vec fingerprints.

This class convert molecules to vector representations by using Mol2Vec. Mol2Vec is an unsupervised machine learning approach to learn vector representations of molecular substructures and the algorithm is based on Word2Vec, which is one of the most popular technique to learn word embeddings using neural network in NLP. Please see the details from [1]_.

The Mol2Vec requires the pretrained model, so we use the model which is put on the mol2vec github repository [2]_. The default model was trained on 20 million compounds downloaded from ZINC using the following paramters.

radius 1
UNK to replace all identifiers that appear less than 4 times
skip-gram and window size of 10
embeddings size 300

References

Note

This class requires mol2vec to be installed.

Examples

>>> import deepchem as dc
>>> from rdkit import Chem
>>> smiles = ['CCC']
>>> featurizer = dc.feat.Mol2VecFingerprint()
>>> features = featurizer.featurize(smiles)
>>> type(features)
<class 'numpy.ndarray'>
>>> features[0].shape
(300,)

__init__(pretrain_model_path: str | None = None, radius: int = 1, unseen: str = 'UNK')[source]¶

Parameters:

pretrain_file (str, optional) – The path for pretrained model. If this value is None, we use the model which is put on github repository (https://github.com/samoturk/mol2vec/tree/master/examples/models). The model is trained on 20 million compounds downloaded from ZINC.
radius (int, optional (default 1)) – The fingerprint radius. The default value was used to train the model which is put on github repository.
unseen (str, optional (default 'UNK')) – The string to used to replace uncommon words/identifiers while training.

sentences2vec(sentences: list, model, unseen=None) → ndarray[source]¶

Generate vectors for each sentence (list) in a list of sentences. Vector is simply a sum of vectors for individual words.

Parameters:

sentences (list, array) – List with sentences
model (word2vec.Word2Vec) – Gensim word2vec model
unseen (None, str) – Keyword for unseen words. If None, those words are skipped. https://stats.stackexchange.com/questions/163005/how-to-set-the-dictionary-for-text-analysis-using-neural-networks/163032#163032

Return type:

np.array

featurize(datapoints, log_every_n=1000, **kwargs) → ndarray[source]¶

Calculate features for molecules.

Parameters:

datapoints (rdkit.Chem.rdchem.Mol / SMILES string / iterable) – RDKit Mol, or SMILES string or iterable sequence of RDKit mols/SMILES strings.
log_every_n (int, default 1000) – Logging messages reported every log_every_n samples.

Returns:

features – A numpy array containing a featurized representation of datapoints.

Return type:

np.ndarray

RDKitDescriptors ¶

class RDKitDescriptors(descriptors: List[str] = [], is_normalized: bool = False, use_fragment: bool = True, ipc_avg: bool = True, use_bcut2d: bool = True, labels_only: bool = False)[source]¶

RDKit descriptors.

This class computes a list of chemical descriptors like molecular weight, number of valence electrons, maximum and minimum partial charge, etc using RDKit.

This class can also compute normalized descriptors, if required. (The implementation for normalization is based on RDKit2DNormalized() method in ‘descriptastorus’ library.)

When the is_normalized option is set as True, descriptor values are normalized across the sample by fitting a cumulative density function. CDFs were used as opposed to simpler scaling algorithms mainly because CDFs have the useful property that ‘each value has the same meaning: the percentage of the population observed below the raw feature value.’

Warning: Currently, the normalizing cdf parameters are not available for BCUT2D descriptors. (BCUT2D_MWHI, BCUT2D_MWLOW, BCUT2D_CHGHI, BCUT2D_CHGLO, BCUT2D_LOGPHI, BCUT2D_LOGPLOW, BCUT2D_MRHI, BCUT2D_MRLOW)

Note

This class requires RDKit to be installed.

Examples

>>> import deepchem as dc
>>> smiles = ['CC(=O)OC1=CC=CC=C1C(=O)O']
>>> featurizer = dc.feat.RDKitDescriptors()
>>> features = featurizer.featurize(smiles)
>>> type(features[0])
<class 'numpy.ndarray'>
>>> features[0].shape
(210,)

__init__(descriptors: List[str] = [], is_normalized: bool = False, use_fragment: bool = True, ipc_avg: bool = True, use_bcut2d: bool = True, labels_only: bool = False)[source]¶

Initialize this featurizer.

Parameters:

descriptors (List[str] (default None)) – List of RDKit descriptors to compute properties. When None, computes values
arguments. (for descriptors which are chosen based on options set in other) –
use_fragment (bool, optional (default True)) – If True, the return value includes the fragment binary descriptors like ‘fr_XXX’.
ipc_avg (bool, optional (default True)) – If True, the IPC descriptor calculates with avg=True option. Please see this issue: https://github.com/rdkit/rdkit/issues/1527.
is_normalized (bool, optional (default False)) – If True, the return value contains normalized features.
use_bcut2d (bool, optional (default True)) – If True, the return value includes the descriptors like ‘BCUT2D_XXX’.
labels_only (bool, optional (default False)) – Returns only the presence or absence of a group.

Notes

If both labels_only and is_normalized are True, then is_normalized takes
precendence and labels_only will not be applied.

featurize(datapoints, log_every_n=1000, **kwargs) → ndarray[source]¶

Calculate features for molecules.

Parameters:

datapoints (rdkit.Chem.rdchem.Mol / SMILES string / iterable) – RDKit Mol, or SMILES string or iterable sequence of RDKit mols/SMILES strings.
log_every_n (int, default 1000) – Logging messages reported every log_every_n samples.

Returns:

features – A numpy array containing a featurized representation of datapoints.

Return type:

np.ndarray

MordredDescriptors ¶

class MordredDescriptors(ignore_3D: bool = True)[source]¶

Mordred descriptors.

This class computes a list of chemical descriptors using Mordred. Please see the details about all descriptors from [1]_, [2]_.

descriptors[source]¶

List of Mordred descriptor names used in this class.

Type:: List[str]

References

Note

This class requires Mordred to be installed.

Examples

>>> import deepchem as dc
>>> smiles = ['CC(=O)OC1=CC=CC=C1C(=O)O']
>>> featurizer = dc.feat.MordredDescriptors(ignore_3D=True)
>>> features = featurizer.featurize(smiles)
>>> type(features[0])
<class 'numpy.ndarray'>
>>> features[0].shape
(1613,)

__init__(ignore_3D: bool = True)[source]¶

Parameters:: ignore_3D (bool, optional (default True)) – Whether to use 3D information or not.

featurize(datapoints, log_every_n=1000, **kwargs) → ndarray[source]¶

Calculate features for molecules.

Parameters:

datapoints (rdkit.Chem.rdchem.Mol / SMILES string / iterable) – RDKit Mol, or SMILES string or iterable sequence of RDKit mols/SMILES strings.
log_every_n (int, default 1000) – Logging messages reported every log_every_n samples.

Returns:

features – A numpy array containing a featurized representation of datapoints.

Return type:

np.ndarray

CoulombMatrix ¶

class CoulombMatrix(max_atoms: int, remove_hydrogens: bool = False, randomize: bool = False, upper_tri: bool = False, n_samples: int = 1, seed: int | None = None)[source]¶

Calculate Coulomb matrices for molecules.

Coulomb matrices provide a representation of the electronic structure of a molecule. For a molecule with N atoms, the Coulomb matrix is a N X N matrix where each element gives the strength of the electrostatic interaction between two atoms. The method is described in more detail in [1]_.

Examples

>>> import deepchem as dc
>>> featurizers = dc.feat.CoulombMatrix(max_atoms=23)
>>> input_file = 'deepchem/feat/tests/data/water.sdf' # really backed by water.sdf.csv
>>> tasks = ["atomization_energy"]
>>> loader = dc.data.SDFLoader(tasks, featurizer=featurizers)
>>> dataset = loader.create_dataset(input_file)

References

Note

This class requires RDKit to be installed.

__init__(max_atoms: int, remove_hydrogens: bool = False, randomize: bool = False, upper_tri: bool = False, n_samples: int = 1, seed: int | None = None)[source]¶

Initialize this featurizer.

Parameters:

max_atoms (int) – The maximum number of atoms expected for molecules this featurizer will process.
remove_hydrogens (bool, optional (default False)) – If True, remove hydrogens before processing them.
randomize (bool, optional (default False)) – If True, use method randomize_coulomb_matrices to randomize Coulomb matrices.
upper_tri (bool, optional (default False)) – Generate only upper triangle part of Coulomb matrices.
n_samples (int, optional (default 1)) – If randomize is set to True, the number of random samples to draw.
seed (int, optional (default None)) – Random seed to use.

coulomb_matrix(mol: Any) → ndarray[source]¶

Generate Coulomb matrices for each conformer of the given molecule.

Parameters:: mol (rdkit.Chem.rdchem.Mol) – RDKit Mol object
Returns:: The coulomb matrices of the given molecule
Return type:: np.ndarray

randomize_coulomb_matrix(m: ndarray) → List[ndarray][source]¶

Randomize a Coulomb matrix as decribed in [1]_:

Compute row norms for M in a vector row_norms.
Sample a zero-mean unit-variance noise vector e with dimension
equal to row_norms.
Permute the rows and columns of M with the permutation that
sorts row_norms + e.

Parameters:: m (np.ndarray) – Coulomb matrix.
Returns:: List of the random coulomb matrix
Return type:: List[np.ndarray]

References

static get_interatomic_distances(conf: Any) → ndarray[source]¶

Get interatomic distances for atoms in a molecular conformer.

Parameters:: conf (rdkit.Chem.rdchem.Conformer) – Molecule conformer.
Returns:: The distances matrix for all atoms in a molecule
Return type:: np.ndarray

featurize(datapoints, log_every_n=1000, **kwargs) → ndarray[source]¶

Calculate features for molecules.

Parameters:

datapoints (rdkit.Chem.rdchem.Mol / SMILES string / iterable) – RDKit Mol, or SMILES string or iterable sequence of RDKit mols/SMILES strings.
log_every_n (int, default 1000) – Logging messages reported every log_every_n samples.

Returns:

features – A numpy array containing a featurized representation of datapoints.

Return type:

np.ndarray

CoulombMatrixEig ¶

class CoulombMatrixEig(max_atoms: int, remove_hydrogens: bool = False, randomize: bool = False, n_samples: int = 1, seed: int | None = None)[source]¶

Calculate the eigenvalues of Coulomb matrices for molecules.

This featurizer computes the eigenvalues of the Coulomb matrices for provided molecules. Coulomb matrices are described in [1]_.

Examples

>>> import deepchem as dc
>>> featurizers = dc.feat.CoulombMatrixEig(max_atoms=23)
>>> input_file = 'deepchem/feat/tests/data/water.sdf' # really backed by water.sdf.csv
>>> tasks = ["atomization_energy"]
>>> loader = dc.data.SDFLoader(tasks, featurizer=featurizers)
>>> dataset = loader.create_dataset(input_file)

References

__init__(max_atoms: int, remove_hydrogens: bool = False, randomize: bool = False, n_samples: int = 1, seed: int | None = None)[source]¶

Initialize this featurizer.

Parameters:

max_atoms (int) – The maximum number of atoms expected for molecules this featurizer will process.
remove_hydrogens (bool, optional (default False)) – If True, remove hydrogens before processing them.
randomize (bool, optional (default False)) – If True, use method randomize_coulomb_matrices to randomize Coulomb matrices.
n_samples (int, optional (default 1)) – If randomize is set to True, the number of random samples to draw.
seed (int, optional (default None)) – Random seed to use.

coulomb_matrix(mol: Any) → ndarray[source]¶

Generate Coulomb matrices for each conformer of the given molecule.

Parameters:: mol (rdkit.Chem.rdchem.Mol) – RDKit Mol object
Returns:: The coulomb matrices of the given molecule
Return type:: np.ndarray

featurize(datapoints, log_every_n=1000, **kwargs) → ndarray[source]¶

Calculate features for molecules.

Parameters:

datapoints (rdkit.Chem.rdchem.Mol / SMILES string / iterable) – RDKit Mol, or SMILES string or iterable sequence of RDKit mols/SMILES strings.
log_every_n (int, default 1000) – Logging messages reported every log_every_n samples.

Returns:

features – A numpy array containing a featurized representation of datapoints.

Return type:

np.ndarray

static get_interatomic_distances(conf: Any) → ndarray[source]¶

Get interatomic distances for atoms in a molecular conformer.

Parameters:: conf (rdkit.Chem.rdchem.Conformer) – Molecule conformer.
Returns:: The distances matrix for all atoms in a molecule
Return type:: np.ndarray

randomize_coulomb_matrix(m: ndarray) → List[ndarray][source]¶

Randomize a Coulomb matrix as decribed in [1]_:

Compute row norms for M in a vector row_norms.
Sample a zero-mean unit-variance noise vector e with dimension
equal to row_norms.
Permute the rows and columns of M with the permutation that
sorts row_norms + e.

Parameters:: m (np.ndarray) – Coulomb matrix.
Returns:: List of the random coulomb matrix
Return type:: List[np.ndarray]

References

AtomCoordinates ¶

class AtomicCoordinates(use_bohr: bool = False)[source]¶

Calculate atomic coordinates.

Examples

>>> import deepchem as dc
>>> from rdkit import Chem
>>> mol = Chem.MolFromSmiles('C1C=CC=CC=1')
>>> n_atoms = len(mol.GetAtoms())
>>> n_atoms
6
>>> featurizer = dc.feat.AtomicCoordinates(use_bohr=False)
>>> features = featurizer.featurize([mol])
>>> type(features[0])
<class 'numpy.ndarray'>
>>> features[0].shape # (n_atoms, 3)
(6, 3)

Note

This class requires RDKit to be installed.

__init__(use_bohr: bool = False)[source]¶

Parameters:: use_bohr (bool, optional (default False)) – Whether to use bohr or angstrom as a coordinate unit.

featurize(datapoints, log_every_n=1000, **kwargs) → ndarray[source]¶

Calculate features for molecules.

Parameters:

datapoints (rdkit.Chem.rdchem.Mol / SMILES string / iterable) – RDKit Mol, or SMILES string or iterable sequence of RDKit mols/SMILES strings.
log_every_n (int, default 1000) – Logging messages reported every log_every_n samples.

Returns:

features – A numpy array containing a featurized representation of datapoints.

Return type:

np.ndarray

BPSymmetryFunctionInput ¶

class BPSymmetryFunctionInput(max_atoms: int)[source]¶

Calculate symmetry function for each atom in the molecules

This method is described in [1]_.

Examples

>>> import deepchem as dc
>>> smiles = ['C1C=CC=CC=1']
>>> featurizer = dc.feat.BPSymmetryFunctionInput(max_atoms=10)
>>> features = featurizer.featurize(smiles)
>>> type(features[0])
<class 'numpy.ndarray'>
>>> features[0].shape  # (max_atoms, 4)
(10, 4)

References

Note

This class requires RDKit to be installed.

__init__(max_atoms: int)[source]¶

Initialize this featurizer.

Parameters:: max_atoms (int) – The maximum number of atoms expected for molecules this featurizer will process.

featurize(datapoints, log_every_n=1000, **kwargs) → ndarray[source]¶

Calculate features for molecules.

Parameters:

datapoints (rdkit.Chem.rdchem.Mol / SMILES string / iterable) – RDKit Mol, or SMILES string or iterable sequence of RDKit mols/SMILES strings.
log_every_n (int, default 1000) – Logging messages reported every log_every_n samples.

Returns:

features – A numpy array containing a featurized representation of datapoints.

Return type:

np.ndarray

SmilesToSeq ¶

class SmilesToSeq(char_to_idx: Dict[str, int], max_len: int = 250, pad_len: int = 10)[source]¶

SmilesToSeq Featurizer takes a SMILES string, and turns it into a sequence. Details taken from [1]_.

SMILES strings smaller than a specified max length (max_len) are padded using the PAD token while those larger than the max length are not considered. Based on the paper, there is also the option to add extra padding (pad_len) on both sides of the string after length normalization. Using a character to index (char_to_idx) mapping, the SMILES characters are turned into indices and the resulting sequence of indices serves as the input for an embedding layer.

References

Note

This class requires RDKit to be installed.

__init__(char_to_idx: Dict[str, int], max_len: int = 250, pad_len: int = 10)[source]¶

Initialize this class.

Parameters:

char_to_idx (Dict) – Dictionary containing character to index mappings for unique characters
max_len (int, default 250) – Maximum allowed length of the SMILES string.
pad_len (int, default 10) – Amount of padding to add on either side of the SMILES seq

to_seq(smile: List[str]) → ndarray[source]¶: Turns list of smiles characters into array of indices

remove_pad(characters: List[str]) → List[str][source]¶: Removes PAD_TOKEN from the character list.

smiles_from_seq(seq: List[int]) → str[source]¶: Reconstructs SMILES string from sequence.

featurize(datapoints, log_every_n=1000, **kwargs) → ndarray[source]¶

Calculate features for molecules.

Parameters:

datapoints (rdkit.Chem.rdchem.Mol / SMILES string / iterable) – RDKit Mol, or SMILES string or iterable sequence of RDKit mols/SMILES strings.
log_every_n (int, default 1000) – Logging messages reported every log_every_n samples.

Returns:

features – A numpy array containing a featurized representation of datapoints.

Return type:

np.ndarray

SmilesToImage ¶

class SmilesToImage(img_size: int = 80, res: float = 0.5, max_len: int = 250, img_spec: str = 'std')[source]¶

Convert SMILES string to an image.

SmilesToImage Featurizer takes a SMILES string, and turns it into an image. Details taken from [1]_.

The default size of for the image is 80 x 80. Two image modes are currently supported - std & engd. std is the gray scale specification, with atomic numbers as pixel values for atom positions and a constant value of 2 for bond positions. engd is a 4-channel specification, which uses atom properties like hybridization, valency, charges in addition to atomic number. Bond type is also used for the bonds.

The coordinates of all atoms are computed, and lines are drawn between atoms to indicate bonds. For the respective channels, the atom and bond positions are set to the property values as mentioned in the paper.

Examples

>>> import deepchem as dc
>>> smiles = ['CC(=O)OC1=CC=CC=C1C(=O)O']
>>> featurizer = dc.feat.SmilesToImage(img_size=80, img_spec='std')
>>> images = featurizer.featurize(smiles)
>>> type (images[0])
<class 'numpy.ndarray'>
>>> images[0].shape # (img_size, img_size, 1)
(80, 80, 1)

References

Note

This class requires RDKit to be installed.

__init__(img_size: int = 80, res: float = 0.5, max_len: int = 250, img_spec: str = 'std')[source]¶

Parameters:

img_size (int, default 80) – Size of the image tensor
res (float, default 0.5) – Displays the resolution of each pixel in Angstrom
max_len (int, default 250) – Maximum allowed length of SMILES string
img_spec (str, default std) – Indicates the channel organization of the image tensor

featurize(datapoints, log_every_n=1000, **kwargs) → ndarray[source]¶

Calculate features for molecules.

Parameters:

datapoints (rdkit.Chem.rdchem.Mol / SMILES string / iterable) – RDKit Mol, or SMILES string or iterable sequence of RDKit mols/SMILES strings.
log_every_n (int, default 1000) – Logging messages reported every log_every_n samples.

Returns:

features – A numpy array containing a featurized representation of datapoints.

Return type:

np.ndarray

OneHotFeaturizer ¶

class OneHotFeaturizer(charset: List[str] = ['#', ')', '(', '+', '-', '/', '1', '3', '2', '5', '4', '7', '6', '8', '=', '@', 'C', 'B', 'F', 'I', 'H', 'O', 'N', 'S', '[', ']', '\\', 'c', 'l', 'o', 'n', 'p', 's', 'r'], max_length: int | None = 100)[source]¶

Encodes any arbitrary string or molecule as a one-hot array.

This featurizer encodes the characters within any given string as a one-hot array. It also works with RDKit molecules: it can convert RDKit molecules to SMILES strings and then one-hot encode the characters in said strings.

Standalone Usage:

>>> import deepchem as dc
>>> featurizer = dc.feat.OneHotFeaturizer()
>>> smiles = ['CCC']
>>> encodings = featurizer.featurize(smiles)
>>> type(encodings[0])
<class 'numpy.ndarray'>
>>> encodings[0].shape
(100, 35)
>>> featurizer.untransform(encodings[0])
'CCC'

Note

This class needs RDKit to be installed in order to accept RDKit molecules as inputs.

It does not need RDKit to be installed to work with arbitrary strings.

__init__(charset: List[str] = ['#', ')', '(', '+', '-', '/', '1', '3', '2', '5', '4', '7', '6', '8', '=', '@', 'C', 'B', 'F', 'I', 'H', 'O', 'N', 'S', '[', ']', '\\', 'c', 'l', 'o', 'n', 'p', 's', 'r'], max_length: int | None = 100)[source]¶

Initialize featurizer.

Parameters:

charset (List[str] (default ZINC_CHARSET)) – A list of strings, where each string is length 1 and unique.
max_length (Optional[int], optional (default 100)) – The max length for string. If the length of string is shorter than max_length, the string is padded using space.
None (If max_length is) –
length (no padding is performed and arbitrary) –
allowed. (strings are) –

featurize(datapoints: Iterable[Any], log_every_n: int = 1000, **kwargs) → ndarray[source]¶

Featurize strings or mols.

Parameters:

datapoints (list) – A list of either strings (str or numpy.str_) or RDKit molecules.
log_every_n (int, optional (default 1000)) – How many elements are featurized every time a featurization is logged.

pad_smile(smiles: str) → str[source]¶

Pad SMILES string to self.pad_length

Parameters:: smiles (str) – The SMILES string to be padded.
Returns:: SMILES string space padded to self.pad_length
Return type:: str

pad_string(string: str) → str[source]¶

Pad string to self.pad_length

Parameters:: string (str) – The string to be padded.
Returns:: String space padded to self.pad_length
Return type:: str

untransform(one_hot_vectors: ndarray) → str[source]¶

Convert from one hot representation back to original string

Parameters:: one_hot_vectors (np.ndarray) – An array of one hot encoded features.
Returns:: Original string for an one hot encoded array.
Return type:: str

SparseMatrixOneHotFeaturizer ¶

class SparseMatrixOneHotFeaturizer(charset: List[str] = ['A', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'K', 'L', 'M', 'N', 'P', 'Q', 'R', 'S', 'T', 'V', 'W', 'Y', 'X', 'Z', 'B', 'U', 'O'])[source]¶

Encodes any arbitrary string as a one-hot array.

This featurizer uses the sklearn OneHotEncoder to create sparse matrix representation of a one-hot array of any string. It is expected to be used in large datasets that produces memory overload using standard featurizer such as OneHotFeaturizer. For example: SwissprotDataset

Examples

>>> import deepchem as dc
>>> featurizer = dc.feat.SparseMatrixOneHotFeaturizer()
>>> sequence = "MMMQLA"
>>> encodings = featurizer.featurize([sequence])
>>> encodings[0].shape
(6, 25)

__init__(charset: List[str] = ['A', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'K', 'L', 'M', 'N', 'P', 'Q', 'R', 'S', 'T', 'V', 'W', 'Y', 'X', 'Z', 'B', 'U', 'O'])[source]¶

Initialize featurizer.

Parameters:: charset (List[str] (default code)) – A list of strings, where each string is length 1 and unique.

featurize(datapoints: Iterable[Any], log_every_n: int = 1000, **kwargs) → ndarray[source]¶

Featurize strings.

Parameters:

datapoints (list) – A list of either strings (str or numpy.str_)
log_every_n (int, optional (default 1000)) – How many elements are featurized every time a featurization is logged.

untransform(one_hot_vectors: spmatrix) → str[source]¶

Convert from one hot representation back to original string

Parameters:: one_hot_vectors (np.ndarray) – An array of one hot encoded features.
Returns:: Original string for an one hot encoded array.
Return type:: str

RawFeaturizer ¶

class RawFeaturizer(smiles: bool = False)[source]¶

Encodes a molecule as a SMILES string or RDKit mol.

This featurizer can be useful when you’re trying to transform a large collection of RDKit mol objects as Smiles strings, or alternatively as a “no-op” featurizer in your molecular pipeline.

Note

This class requires RDKit to be installed.

__init__(smiles: bool = False)[source]¶

Initialize this featurizer.

Parameters:: smiles (bool, optional (default False)) – If True, encode this molecule as a SMILES string. Else as a RDKit mol.

featurize(datapoints, log_every_n=1000, **kwargs) → ndarray[source]¶

Calculate features for molecules.

Parameters:

datapoints (rdkit.Chem.rdchem.Mol / SMILES string / iterable) – RDKit Mol, or SMILES string or iterable sequence of RDKit mols/SMILES strings.
log_every_n (int, default 1000) – Logging messages reported every log_every_n samples.

Returns:

features – A numpy array containing a featurized representation of datapoints.

Return type:

np.ndarray

SNAPFeaturizer ¶

class SNAPFeaturizer(use_original_atoms_order=False)[source]¶

This featurizer is based on the SNAP featurizer used in the paper [1].

Example

>>> smiles = ["CC(=O)C"]
>>> featurizer = SNAPFeaturizer()
>>> print(featurizer.featurize(smiles))
[GraphData(node_features=[4, 2], edge_index=[2, 6], edge_features=[6, 2])]

References

__init__(use_original_atoms_order=False)[source]¶

Parameters:: use_original_atoms_order (bool, default False) – Whether to use original atom ordering or canonical ordering (default)

featurize(datapoints, log_every_n=1000, **kwargs) → ndarray[source]¶

Calculate features for molecules.

Parameters:

datapoints (rdkit.Chem.rdchem.Mol / SMILES string / iterable) – RDKit Mol, or SMILES string or iterable sequence of RDKit mols/SMILES strings.
log_every_n (int, default 1000) – Logging messages reported every log_every_n samples.

Returns:

features – A numpy array containing a featurized representation of datapoints.

Return type:

np.ndarray

Molecular Complex Featurizers ¶

These featurizers work with three dimensional molecular complexes.

RdkitGridFeaturizer ¶

class RdkitGridFeaturizer(nb_rotations=0, feature_types=None, ecfp_degree=2, ecfp_power=3, splif_power=3, box_width=16.0, voxel_width=1.0, flatten=False, verbose=True, sanitize=False, **kwargs)[source]¶

Featurizes protein-ligand complex using flat features or a 3D grid (in which each voxel is described with a vector of features).

__init__(nb_rotations=0, feature_types=None, ecfp_degree=2, ecfp_power=3, splif_power=3, box_width=16.0, voxel_width=1.0, flatten=False, verbose=True, sanitize=False, **kwargs)[source]¶

Parameters:

nb_rotations (int, optional (default 0)) – Number of additional random rotations of a complex to generate.
feature_types (list, optional (default ['ecfp'])) –

Types of features to calculate. Available types are
flat features -> ‘ecfp_ligand’, ‘ecfp_hashed’, ‘splif_hashed’, ‘hbond_count’ voxel features -> ‘ecfp’, ‘splif’, ‘sybyl’, ‘salt_bridge’, ‘charge’, ‘hbond’, ‘pi_stack, ‘cation_pi’

There are also 3 predefined sets of features
’flat_combined’, ‘voxel_combined’, and ‘all_combined’.

Calculated features are concatenated and their order is preserved (features in predefined sets are in alphabetical order).
ecfp_degree (int, optional (default 2)) – ECFP radius.
ecfp_power (int, optional (default 3)) – Number of bits to store ECFP features (resulting vector will be 2^ecfp_power long)
splif_power (int, optional (default 3)) – Number of bits to store SPLIF features (resulting vector will be 2^splif_power long)
box_width (float, optional (default 16.0)) – Size of a box in which voxel features are calculated. Box is centered on a ligand centroid.
voxel_width (float, optional (default 1.0)) – Size of a 3D voxel in a grid.
flatten (bool, optional (defaul False)) – Indicate whether calculated features should be flattened. Output is always flattened if flat features are specified in feature_types.
verbose (bool, optional (defaul True)) – Verbolity for logging
sanitize (bool, optional (defaul False)) – If set to True molecules will be sanitized. Note that calculating some features (e.g. aromatic interactions) require sanitized molecules.
**kwargs (dict, optional) – Keyword arguments can be usaed to specify custom cutoffs and bins (see default values below).
bins (Default cutoffs and) –
------------------------ –
hbond_dist_bins ([(2.2, 2.5), (2.5, 3.2), (3.2, 4.0)]) –
hbond_angle_cutoffs ([5, 50, 90]) –
splif_contact_bins ([(0, 2.0), (2.0, 3.0), (3.0, 4.5)]) –
ecfp_cutoff (4.5) –
sybyl_cutoff (7.0) –
salt_bridges_cutoff (5.0) –
pi_stack_dist_cutoff (4.4) –
pi_stack_angle_cutoff (30.0) –
cation_pi_dist_cutoff (6.5) –
cation_pi_angle_cutoff (30.0) –

featurize(datapoints: Iterable[Tuple[str, str]] | None = None, log_every_n: int = 100, **kwargs) → ndarray[source]¶

Calculate features for mol/protein complexes. :param datapoints: List of filenames (PDB, SDF, etc.) for ligand molecules and proteins.

Each element should be a tuple of the form (ligand_filename, protein_filename).

Returns:: features – Array of features
Return type:: np.ndarray

AtomicConvFeaturizer ¶

class AtomicConvFeaturizer(frag1_num_atoms, frag2_num_atoms, complex_num_atoms, max_num_neighbors, neighbor_cutoff, strip_hydrogens=True)[source]¶

This class computes the featurization that corresponds to AtomicConvModel.

This class computes featurizations needed for AtomicConvModel. Given two molecular structures, it computes a number of useful geometric features. In particular, for each molecule and the global complex, it computes a coordinates matrix of size (N_atoms, 3) where N_atoms is the number of atoms. It also computes a neighbor-list, a dictionary with N_atoms elements where neighbor-list[i] is a list of the atoms the i-th atom has as neighbors. In addition, it computes a z-matrix for the molecule which is an array of shape (N_atoms,) that contains the atomic number of that atom.

Since the featurization computes these three quantities for each of the two molecules and the complex, a total of 9 quantities are returned for each complex. Note that for efficiency, fragments of the molecules can be provided rather than the full molecules themselves.

__init__(frag1_num_atoms, frag2_num_atoms, complex_num_atoms, max_num_neighbors, neighbor_cutoff, strip_hydrogens=True)[source]¶

Parameters:

frag1_num_atoms (int) – Maximum number of atoms in fragment 1.
frag2_num_atoms (int) – Maximum number of atoms in fragment 2.
complex_num_atoms (int) – Maximum number of atoms in complex of frag1/frag2 together.
max_num_neighbors (int) – Maximum number of atoms considered as neighbors.
neighbor_cutoff (float) – Maximum distance (angstroms) for two atoms to be considered as neighbors. If more than max_num_neighbors atoms fall within this cutoff, the closest max_num_neighbors will be used.
strip_hydrogens (bool (default True)) – Remove hydrogens before computing featurization.

featurize(datapoints: Iterable[Tuple[str, str]] | None = None, log_every_n: int = 100, **kwargs) → ndarray[source]¶

Calculate features for mol/protein complexes. :param datapoints: List of filenames (PDB, SDF, etc.) for ligand molecules and proteins.

Each element should be a tuple of the form (ligand_filename, protein_filename).

Returns:: features – Array of features
Return type:: np.ndarray

Inorganic Crystal Featurizers ¶

These featurizers work with datasets of inorganic crystals.

MaterialCompositionFeaturizer ¶

Material Composition Featurizers are those that work with datasets of crystal compositions with periodic boundary conditions. For inorganic crystal structures, these featurizers operate on chemical compositions (e.g. “MoS2”). They should be applied on systems that have periodic boundary conditions. Composition featurizers are not designed to work with molecules.

ElementPropertyFingerprint ¶

class ElementPropertyFingerprint(data_source: str = 'matminer')[source]¶

Fingerprint of elemental properties from composition.

Based on the data source chosen, returns properties and statistics (min, max, range, mean, standard deviation, mode) for a compound based on elemental stoichiometry. E.g., the average electronegativity of atoms in a crystal structure. The chemical fingerprint is a vector of these statistics. For a full list of properties and statistics, see matminer.featurizers.composition.ElementProperty(data_source).feature_labels().

This featurizer requires the optional dependencies pymatgen and matminer. It may be useful when only crystal compositions are available (and not 3D coordinates).

See references [1]_, [2]_, [3], [4] for more details.

References

Examples

>>> import deepchem as dc
>>> import pymatgen as mg
>>> comp = mg.core.Composition("Fe2O3")
>>> featurizer = dc.feat.ElementPropertyFingerprint()
>>> features = featurizer.featurize([comp])
>>> type(features[0])
<class 'numpy.ndarray'>
>>> features[0].shape
(65,)

Note

This class requires matminer and Pymatgen to be installed. NaN feature values are automatically converted to 0 by this featurizer.

__init__(data_source: str = 'matminer')[source]¶

Parameters:: data_source (str of "matminer", "magpie" or "deml" (default "matminer")) – Source for element property data.

featurize(datapoints: Iterable[str] | None = None, log_every_n: int = 1000, **kwargs) → ndarray[source]¶

Calculate features for crystal compositions.

Parameters:

datapoints (Iterable[str]) – Iterable sequence of composition strings, e.g. “MoS2”.
log_every_n (int, default 1000) – Logging messages reported every log_every_n samples.

Returns:

features – A numpy array containing a featurized representation of compositions.

Return type:

np.ndarray

ElemNetFeaturizer ¶

class ElemNetFeaturizer[source]¶

Fixed size vector of length 86 containing raw fractional elemental compositions in the compound. The 86 chosen elements are based on the original implementation at https://github.com/NU-CUCIS/ElemNet.

Returns a vector containing fractional compositions of each element in the compound.

References

Examples

>>> import deepchem as dc
>>> comp = "Fe2O3"
>>> featurizer = dc.feat.ElemNetFeaturizer()
>>> features = featurizer.featurize([comp])
>>> type(features[0])
<class 'numpy.ndarray'>
>>> features[0].shape
(86,)
>>> round(sum(features[0]))
1

Note

This class requires Pymatgen to be installed.

get_vector(comp: DefaultDict) → ndarray | None[source]¶

Converts a dictionary containing element names and corresponding compositional fractions into a vector of fractions.

Parameters:: comp (collections.defaultdict object) – Dictionary mapping element names to fractional compositions.
Returns:: fractions – Vector of fractional compositions of each element.
Return type:: np.ndarray

MaterialStructureFeaturizer ¶

Material Structure Featurizers are those that work with datasets of crystals with periodic boundary conditions. For inorganic crystal structures, these featurizers operate on pymatgen.Structure objects, which include a lattice and 3D coordinates that specify a periodic crystal structure. They should be applied on systems that have periodic boundary conditions. Structure featurizers are not designed to work with molecules.

SineCoulombMatrix ¶

class SineCoulombMatrix(max_atoms: int = 100, flatten: bool = True)[source]¶

Calculate sine Coulomb matrix for crystals.

A variant of Coulomb matrix for periodic crystals.

The sine Coulomb matrix is identical to the Coulomb matrix, except that the inverse distance function is replaced by the inverse of sin**2 of the vector between sites which are periodic in the dimensions of the crystal lattice.

Features are flattened into a vector of matrix eigenvalues by default for ML-readiness. To ensure that all feature vectors are equal length, the maximum number of atoms (eigenvalues) in the input dataset must be specified.

This featurizer requires the optional dependencies pymatgen and matminer. It may be useful when crystal structures with 3D coordinates are available.

See [1]_ for more details.

References

Examples

>>> import deepchem as dc
>>> import pymatgen as mg
>>> lattice = mg.core.Lattice.cubic(4.2)
>>> structure = mg.core.Structure(lattice, ["Cs", "Cl"], [[0, 0, 0], [0.5, 0.5, 0.5]])
>>> featurizer = dc.feat.SineCoulombMatrix(max_atoms=2)
>>> features = featurizer.featurize([structure])
>>> type(features[0])
<class 'numpy.ndarray'>
>>> features[0].shape # (max_atoms,)
(2,)

Note

This class requires matminer and Pymatgen to be installed.

__init__(max_atoms: int = 100, flatten: bool = True)[source]¶

Parameters:

max_atoms (int (default 100)) – Maximum number of atoms for any crystal in the dataset. Used to pad the Coulomb matrix.
flatten (bool (default True)) – Return flattened vector of matrix eigenvalues.

featurize(datapoints: Iterable[Dict[str, Any] | Any] | None = None, log_every_n: int = 1000, **kwargs) → ndarray[source]¶

Calculate features for crystal structures.

Parameters:

datapoints (Iterable[Union[Dict, pymatgen.core.Structure]]) – Iterable sequence of pymatgen structure dictionaries or pymatgen.core.Structure. Please confirm the dictionary representations of pymatgen.core.Structure from https://pymatgen.org/pymatgen.core.structure.html.
log_every_n (int, default 1000) – Logging messages reported every log_every_n samples.

Returns:

features – A numpy array containing a featurized representation of datapoints.

Return type:

np.ndarray

CGCNNFeaturizer ¶

class CGCNNFeaturizer(radius: float = 8.0, max_neighbors: float = 12, step: float = 0.2)[source]¶

Calculate structure graph features for crystals.

Based on the implementation in Crystal Graph Convolutional Neural Networks (CGCNN). The method constructs a crystal graph representation including atom features and bond features (neighbor distances). Neighbors are determined by searching in a sphere around atoms in the unit cell. A Gaussian filter is applied to neighbor distances. All units are in angstrom.

This featurizer requires the optional dependency pymatgen. It may be useful when 3D coordinates are available and when using graph network models and crystal graph convolutional networks.

See [1]_ for more details.

References

Examples

>>> import deepchem as dc
>>> import pymatgen as mg
>>> featurizer = dc.feat.CGCNNFeaturizer()
>>> lattice = mg.core.Lattice.cubic(4.2)
>>> structure = mg.core.Structure(lattice, ["Cs", "Cl"], [[0, 0, 0], [0.5, 0.5, 0.5]])
>>> features = featurizer.featurize([structure])
>>> feature = features[0]
>>> print(type(feature))
<class 'deepchem.feat.graph_data.GraphData'>

Note

This class requires Pymatgen to be installed.

__init__(radius: float = 8.0, max_neighbors: float = 12, step: float = 0.2)[source]¶

Parameters:

radius (float (default 8.0)) – Radius of sphere for finding neighbors of atoms in unit cell.
max_neighbors (int (default 12)) – Maximum number of neighbors to consider when constructing graph.
step (float (default 0.2)) – Step size for Gaussian filter. This value is used when building edge features.

featurize(datapoints: Iterable[Dict[str, Any] | Any] | None = None, log_every_n: int = 1000, **kwargs) → ndarray[source]¶

Calculate features for crystal structures.

Parameters:

datapoints (Iterable[Union[Dict, pymatgen.core.Structure]]) – Iterable sequence of pymatgen structure dictionaries or pymatgen.core.Structure. Please confirm the dictionary representations of pymatgen.core.Structure from https://pymatgen.org/pymatgen.core.structure.html.
log_every_n (int, default 1000) – Logging messages reported every log_every_n samples.

Returns:

features – A numpy array containing a featurized representation of datapoints.

Return type:

np.ndarray

LCNNFeaturizer ¶

class LCNNFeaturizer(structure: Any, aos: List[str], pbc: List[bool], ns: int = 1, na: int = 1, cutoff: float = 6.0)[source]¶

Calculates the 2-D Surface graph features in 6 different permutations-

Based on the implementation of Lattice Graph Convolution Neural Network (LCNN). This method produces the Atom wise features ( One Hot Encoding) and Adjacent neighbour in the specified order of permutations. Neighbors are determined by first extracting a site local environment from the primitive cell, and perform graph matching and distance matching to find neighbors. First, the template of the Primitive cell needs to be defined along with periodic boundary conditions and active and spectator site details. structure(Data Point i.e different configuration of adsorbate atoms) is passed for featurization.

This particular featurization produces a regular-graph (equal number of Neighbors) along with its permutation in 6 symmetric axis. This transformation can be applied when orderering of neighboring of nodes around a site play an important role in the propert predictions. Due to consideration of local neighbor environment, this current implementation would be fruitful in finding neighbors for calculating formation energy of adbsorption tasks where the local. Adsorption turns out to be important in many applications such as catalyst and semiconductor design.

The permuted neighbors are calculated using the Primitive cells i.e periodic cells in all the data points are built via lattice transformation of the primitive cell.

Primitive cell Format:

Pymatgen structure object with site_properties key value
- “SiteTypes” mentioning if it is a active site “A1” or spectator
  site “S1”.
ns , the number of spectator types elements. For “S1” its 1.
na , the number of active types elements. For “A1” its 1.
aos, the different species of active elements “A1”.
pbc, the periodic boundary conditions.

Data point Structure Format(Configuration of Atoms):

Pymatgen structure object with site_properties with following key value.
- “SiteTypes”, mentioning if it is a active site “A1” or spectator
  site “S1”.
- “oss”, different occupational sites. For spectator sites make it -1.

It is highly recommended that cells of data are directly redefined from the primitive cell, specifically, the relative coordinates between sites are consistent so that the lattice is non-deviated.

References

Examples

>>> import deepchem as dc
>>> from pymatgen.core import Structure
>>> import numpy as np
>>> PRIMITIVE_CELL = {
...   "lattice": [[2.818528, 0.0, 0.0],
...               [-1.409264, 2.440917, 0.0],
...               [0.0, 0.0, 25.508255]],
...   "coords": [[0.66667, 0.33333, 0.090221],
...              [0.33333, 0.66667, 0.18043936],
...              [0.0, 0.0, 0.27065772],
...              [0.66667, 0.33333, 0.36087608],
...              [0.33333, 0.66667, 0.45109444],
...              [0.0, 0.0, 0.49656991]],
...   "species": ['H', 'H', 'H', 'H', 'H', 'He'],
...   "site_properties": {'SiteTypes': ['S1', 'S1', 'S1', 'S1', 'S1', 'A1']}
... }
>>> PRIMITIVE_CELL_INF0 = {
...    "cutoff": np.around(6.00),
...    "structure": Structure(**PRIMITIVE_CELL),
...    "aos": ['1', '0', '2'],
...    "pbc": [True, True, False],
...    "ns": 1,
...    "na": 1
... }
>>> DATA_POINT = {
...   "lattice": [[1.409264, -2.440917, 0.0],
...               [4.227792, 2.440917, 0.0],
...               [0.0, 0.0, 23.17559]],
...   "coords": [[0.0, 0.0, 0.099299],
...              [0.0, 0.33333, 0.198598],
...              [0.5, 0.16667, 0.297897],
...              [0.0, 0.0, 0.397196],
...              [0.0, 0.33333, 0.496495],
...              [0.5, 0.5, 0.099299],
...              [0.5, 0.83333, 0.198598],
...              [0.0, 0.66667, 0.297897],
...              [0.5, 0.5, 0.397196],
...              [0.5, 0.83333, 0.496495],
...              [0.0, 0.66667, 0.54654766],
...              [0.5, 0.16667, 0.54654766]],
...   "species": ['H', 'H', 'H', 'H', 'H', 'H',
...               'H', 'H', 'H', 'H', 'He', 'He'],
...   "site_properties": {
...     "SiteTypes": ['S1', 'S1', 'S1', 'S1', 'S1',
...                   'S1', 'S1', 'S1', 'S1', 'S1',
...                   'A1', 'A1'],
...     "oss": ['-1', '-1', '-1', '-1', '-1', '-1',
...             '-1', '-1', '-1', '-1', '0', '2']
...                   }
... }
>>> featuriser = dc.feat.LCNNFeaturizer(**PRIMITIVE_CELL_INF0)
>>> print(type(featuriser._featurize(Structure(**DATA_POINT))))
<class 'deepchem.feat.graph_data.GraphData'>

Notes

This Class requires pymatgen , networkx , scipy installed.

__init__(structure: Any, aos: List[str], pbc: List[bool], ns: int = 1, na: int = 1, cutoff: float = 6.0)[source]¶

Parameters:

structure (: PymatgenStructure) – Pymatgen Structure object of the primitive cell used for calculating neighbors from lattice transformations.It also requires site_properties attribute with “Sitetypes”(Active or spectator site).
aos (List[str]) – A list of all the active site species. For the Pt, N, NO configuration set it as [‘0’, ‘1’, ‘2’]
pbc (List[bool]) – Periodic Boundary Condition
ns (int (default 1)) – The number of spectator types elements. For “S1” its 1.
na (int (default 1)) – the number of active types elements. For “A1” its 1.
cutoff (float (default 6.00)) – Cutoff of radius for getting local environment.Only used down to 2 digits.

featurize(datapoints: Iterable[Dict[str, Any] | Any] | None = None, log_every_n: int = 1000, **kwargs) → ndarray[source]¶

Calculate features for crystal structures.

Parameters:

datapoints (Iterable[Union[Dict, pymatgen.core.Structure]]) – Iterable sequence of pymatgen structure dictionaries or pymatgen.core.Structure. Please confirm the dictionary representations of pymatgen.core.Structure from https://pymatgen.org/pymatgen.core.structure.html.
log_every_n (int, default 1000) – Logging messages reported every log_every_n samples.

Returns:

features – A numpy array containing a featurized representation of datapoints.

Return type:

np.ndarray

Biological Sequence Featurizers ¶

These featurizers work with biological sequences.

SAMFeaturizer ¶

class SAMFeaturizer(max_records=None)[source]¶

Featurizes SAM files, that store biological sequences aligned to a reference sequence. This class extracts Query Name, Query Sequence, Query Length, Reference Name,Reference Start, CIGAR and Mapping Quality of each read in a SAM file.

This is the default featurizer used by SAMLoader, and it extracts the following fields from each read in each SAM file in the given order:- - Column 0: Query Name - Column 1: Query Sequence - Column 2: Query Length - Column 3: Reference Name - Column 4: Reference Start - Column 5: CIGAR - Column 6: Mapping Quality

Examples

>>> from deepchem.data.data_loader import SAMLoader
>>> import deepchem as dc
>>> inputs = 'deepchem/data/tests/example.sam'
>>> featurizer = dc.feat.SAMFeaturizer()
>>> features = featurizer.featurize(inputs)
>>> type(features[0])
<class 'numpy.ndarray'>

Note

This class requires pysam to be installed. Pysam can be used with Linux or MacOS X. To use Pysam on Windows, use Windows Subsystem for Linux(WSL).

__init__(max_records=None)[source]¶

Initialize SAMFeaturizer.

Parameters:: max_records (int or None, optional) – The maximum number of records to extract from the SAM file. If None, all records will be extracted.

featurize(datapoints: Iterable[Any], log_every_n: int = 1000, **kwargs) → ndarray[source]¶

Calculate features for datapoints.

Parameters:

datapoints (Iterable[Any]) – A sequence of objects that you’d like to featurize. Subclassses of Featurizer should instantiate the _featurize method that featurizes objects in the sequence.
log_every_n (int, default 1000) – Logs featurization progress every log_every_n steps.

Returns:

A numpy array containing a featurized representation of datapoints.

Return type:

np.ndarray

BAMFeaturizer ¶

class BAMFeaturizer(max_records=None, get_pileup: bool | None = False)[source]¶

Featurizes BAM files, that are compressed binary representations of SAM (Sequence Alignment Map) files. This class extracts Query Name, Query Sequence, Query Length, Reference Name, Reference Start, CIGAR, Mapping Quality, is_reverse and Query Qualities of the alignment in the BAM file.

This is the default featurizer used by BAMLoader, and it extracts the following fields from each read in each BAM file in the given order:- - Column 0: Query Name - Column 1: Query Sequence - Column 2: Query Length - Column 3: Reference Name - Column 4: Reference Start - Column 5: CIGAR - Column 6: Mapping Quality - Column 7: is_reverse - Column 8: Query Quality Scores - Column 9: Pileup Information (if get_pileup=True)

Additionally, we can also get pileups from BAM files by setting get_pileup=True.A pileup is a summary of the alignment of reads at each position in a reference sequence. Specifically, it provides information on the position on the reference genome, the depth of coverage (i.e., the number of reads aligned to that position), and the actual bases from the aligned reads at that position, along with their quality scores. This data structure is useful for identifying variations, such as single nucleotide polymorphisms (SNPs), insertions, and deletions by comparing the aligned reads to the reference genome. A pileup can be visualized as a vertical stack of aligned sequences, showing how each read matches or mismatches the reference at each position. In DeepVariant, pileups are utilized during the initial stages to select candidate windows for further analysis.

Examples

>>> from deepchem.data.data_loader import BAMLoader
>>> import deepchem as dc
>>> inputs = 'deepchem/data/tests/example.bam'
>>> featurizer = dc.feat.BAMFeaturizer()
>>> features = featurizer.featurize(inputs)
>>> type(features[0])
<class 'numpy.ndarray'>

Note

This class requires pysam to be installed. Pysam can be used with Linux or MacOS X. To use Pysam on Windows, use Windows Subsystem for Linux(WSL).

__init__(max_records=None, get_pileup: bool | None = False)[source]¶

Initialize BAMFeaturizer.

Parameters:

max_records (int or None, optional) – The maximum number of records to extract from the BAM file. If None, all records will be extracted.
get_pileup (bool, optional) – If True, pileup information will be extracted from the BAM file. This is used in DeepVariant. False by default.

featurize(datapoints: Iterable[Any], log_every_n: int = 1000, **kwargs) → ndarray[source]¶

Calculate features for datapoints.

Parameters:

datapoints (Iterable[Any]) – A sequence of objects that you’d like to featurize. Subclassses of Featurizer should instantiate the _featurize method that featurizes objects in the sequence.
log_every_n (int, default 1000) – Logs featurization progress every log_every_n steps.

Returns:

A numpy array containing a featurized representation of datapoints.

Return type:

np.ndarray

CRAMFeaturizer ¶

class CRAMFeaturizer(max_records=None)[source]¶

Featurizes CRAM files, that are compressed columnar file format for storing biological sequences aligned to a reference sequence. This class extracts Query Name, Query Sequence, Query Length, Reference Name, Reference Start, CIGAR and Mapping Quality of the alignment in the CRAM file.

This is the default featurizer used by CRAMLoader, and it extracts the following fields from each read in each CRAM file in the given order:- - Column 0: Query Name - Column 1: Query Sequence - Column 2: Query Length - Column 3: Reference Name - Column 4: Reference Start - Column 5: CIGAR - Column 6: Mapping Quality

Examples

>>> from deepchem.data.data_loader import CRAMLoader
>>> import deepchem as dc
>>> inputs = 'deepchem/data/tests/example.cram'
>>> featurizer = dc.feat.CRAMFeaturizer()
>>> features = featurizer.featurize(inputs)
>>> type(features[0])
<class 'numpy.ndarray'>

Note

This class requires pysam to be installed. Pysam can be used with Linux or MacOS X. To use Pysam on Windows, use Windows Subsystem for Linux(WSL).

__init__(max_records=None)[source]¶

Initialize CRAMFeaturizer.

Parameters:: max_records (int or None, optional) – The maximum number of records to extract from the CRAM file. If None, all records will be extracted.

featurize(datapoints: Iterable[Any], log_every_n: int = 1000, **kwargs) → ndarray[source]¶

Calculate features for datapoints.

Parameters:

datapoints (Iterable[Any]) – A sequence of objects that you’d like to featurize. Subclassses of Featurizer should instantiate the _featurize method that featurizes objects in the sequence.
log_every_n (int, default 1000) – Logs featurization progress every log_every_n steps.

Returns:

A numpy array containing a featurized representation of datapoints.

Return type:

np.ndarray

FASTAFeaturizer ¶

class FASTAFeaturizer[source]¶

Featurizes FASTA files by extracting the sequence names and sequences. Each sequence in the FASTA file is represented as a list containing the sequence name and the raw sequence itself.

Examples

>>> from deepchem.feat import FASTAFeaturizer
>>> featurizer = FASTAFeaturizer()
>>> fasta_file = 'deepchem/data/tests/example.fasta'
>>> features = featurizer.featurize([fasta_file])
>>> type(features[0])
<class 'numpy.ndarray'>
>>> features[0].shape
(3, 2)

Note

This class requires pysam to be installed. Pysam can be used with Linux or MacOS X. To use Pysam on Windows, use Windows Subsystem for Linux(WSL).

featurize(datapoints: Iterable[Any], log_every_n: int = 1000, **kwargs) → ndarray[source]¶

Calculate features for datapoints.

Parameters:

datapoints (Iterable[Any]) – A sequence of objects that you’d like to featurize. Subclassses of Featurizer should instantiate the _featurize method that featurizes objects in the sequence.
log_every_n (int, default 1000) – Logs featurization progress every log_every_n steps.

Returns:

A numpy array containing a featurized representation of datapoints.

Return type:

np.ndarray

DeepVariant Featurizers ¶

These featurizers are used in DeepVariant

RealignerFeaturizer ¶

class RealignerFeaturizer[source]¶

Realigns reads and generates haplotype windows for variant calling. Realignment adds haplotype awareness. A BAM file and a reference FASTA get processed to generate allele counts and reads. Candidate regions are selected based on allele counts. Reads are assigned to regions based on maximum overlap with haplotypes. Smith-Waterman algorithm is used to align reads to haplotypes. Candidate windows are processed to generate window haplotypes with realigned reads.

Examples

>>> from deepchem.feat import RealignerFeaturizer
>>> bamfile_path = 'deepchem/data/tests/example.bam'
>>> reference_path = 'deepchem/data/tests/sample.fa'
>>> featurizer = RealignerFeaturizer()
>>> datapoint = (bamfile_path, reference_path)
>>> features = featurizer.featurize([datapoint])

Note

This class requires Pysam, DGL and Pytorch to be installed. Pysam can be used with Linux or MacOS X. To use Pysam on Windows, use Windows Subsystem for Linux(WSL).

__init__()[source]¶

decode_one_hot(one_hot_vector: List[ndarray], charset: List[str] = ['A', 'C', 'T', 'G', 'N']) → str[source]¶

Decode a one-hot encoded sequence into a string of nucleotides.

This function is not used and retained only for backward compatibility with previous versions of the code.

Parameters:

one_hot_vector (List[np.ndarray]) – List of one-hot encoded vectors.
charset (Optional[List[str]]) – List of characters corresponding to the encoding. Default is [“A”, “C”, “T”, “G”, “N”].

Returns:

Decoded sequence as a string.

Return type:

str

featurize(datapoints: Iterable[Any], log_every_n: int = 1000, **kwargs) → ndarray[source]¶

Calculate features for datapoints.

Parameters:

datapoints (Iterable[Any]) – A sequence of objects that you’d like to featurize. Subclassses of Featurizer should instantiate the _featurize method that featurizes objects in the sequence.
log_every_n (int, default 1000) – Logs featurization progress every log_every_n steps.

Returns:

A numpy array containing a featurized representation of datapoints.

Return type:

np.ndarray

PileupFeaturizer ¶

class PileupFeaturizer(window=221, height=100, channels=6, labeled=False)[source]¶

Featurizer that generates pileup images from BAM files and variant candidates.

This featurizer creates multi-channel pileup image representations of genomic variants from BAM and FASTA files. These images capture various aspects of read alignment around variant positions.

The pileup images have the following channel structure:

Channel 0: Base identity (A=0.25, C=0.5, G=0.75, T=1.0, other=0)
Channel 1: Base quality score (normalized to 0-1 range)
Channel 2: Mapping quality (normalized to 0-1 range)
Channel 3: Read strand (1.0=forward, 0.0=reverse)
Channel 4: Match to alternate allele (1.0=match, 0.0=no match)
Channel 5: Match to reference allele (1.0=match, 0.0=no match)

Examples

>>> from deepchem.feat import PileupFeaturizer
>>> bamfile_path = 'deepchem/data/tests/example.bam'
>>> reference_path = 'deepchem/data/tests/sample.fa'
>>> candidate_variants = ['chr1', 2, 'N', 'T', 2, 2],['chr1', 3, 'N', 'C', 2, 3]
>>> pileup_feat = PileupFeaturizer()
>>> datapoint = (bamfile_path, reference_path, candidate_variants)
>>> features = pileup_feat.featurize([datapoint])

Note

This class requires pysam to be installed. Pysam can be used with Linux or MacOS X. To use Pysam on Windows, use Windows Subsystem for Linux(WSL).

__init__(window=221, height=100, channels=6, labeled=False)[source]¶

featurize(datapoints: Iterable[Any], log_every_n: int = 1000, **kwargs) → ndarray[source]¶

Calculate features for datapoints.

Parameters:

datapoints (Iterable[Any]) – A sequence of objects that you’d like to featurize. Subclassses of Featurizer should instantiate the _featurize method that featurizes objects in the sequence.
log_every_n (int, default 1000) – Logs featurization progress every log_every_n steps.

Returns:

A numpy array containing a featurized representation of datapoints.

Return type:

np.ndarray

Molecule Tokenizers ¶

A tokenizer is in charge of preparing the inputs for a natural language processing model. For many scientific applications, it is possible to treat inputs as “words”/”sentences” and use NLP methods to make meaningful predictions. For example, SMILES strings or DNA sequences have grammatical structure and can be usefully modeled with NLP techniques. DeepChem provides some scientifically relevant tokenizers for use in different applications. These tokenizers are based on those from the Huggingface transformers library (which DeepChem tokenizers inherit from).

The base classes PreTrainedTokenizer and PreTrainedTokenizerFast implements the common methods for encoding string inputs in model inputs and instantiating/saving python tokenizers either from a local file or directory or from a pretrained tokenizer provided by the library (downloaded from HuggingFace’s AWS S3 repository).

PreTrainedTokenizer (transformers.PreTrainedTokenizer) thus implements the main methods for using all the tokenizers:

Tokenizing (spliting strings in sub-word token strings), converting tokens strings to ids and back, and encoding/decoding (i.e. tokenizing + convert to integers)
Adding new tokens to the vocabulary in a way that is independent of the underlying structure (BPE, SentencePiece…)
Managing special tokens like mask, beginning-of-sentence, etc tokens (adding them, assigning them to attributes in the tokenizer for easy access and making sure they are not split during tokenization)

BatchEncoding holds the output of the tokenizer’s encoding methods (__call__, encode_plus and batch_encode_plus) and is derived from a Python dictionary. When the tokenizer is a pure python tokenizer, this class behave just like a standard python dictionary and hold the various model inputs computed by these methodes (input_ids, attention_mask…). For more details on the base tokenizers which the DeepChem tokenizers inherit from, please refer to the following: HuggingFace tokenizers docs

Tokenization methods on string-based corpuses in the life sciences are becoming increasingly popular for NLP-based applications to chemistry and biology. One such example is ChemBERTa, a transformer for molecular property prediction. DeepChem offers a tutorial for utilizing ChemBERTa using an alternate tokenizer, a Byte-Piece Encoder, which can be found here.

SmilesTokenizer ¶

The dc.feat.SmilesTokenizer module inherits from the BertTokenizer class in transformers. It runs a WordPiece tokenization algorithm over SMILES strings using the tokenisation SMILES regex developed by Schwaller et. al.

The SmilesTokenizer employs an atom-wise tokenization strategy using the following Regex expression:

SMI_REGEX_PATTERN = "(\[[^\]]+]|Br?|Cl?|N|O|S|P|F|I|b|c|n|o|s|p|\(|\)|\.|=|#||\+|\\\\\/|:||@|\?|>|\*|\$|\%[0–9]{2}|[0–9])"

To use, please install the transformers package using the following pip command:

pip install transformers

References:

class SmilesTokenizer(vocab_file: str = '', **kwargs)[source]¶

Creates the SmilesTokenizer class. The tokenizer heavily inherits from the BertTokenizer implementation found in Huggingface’s transformers library. It runs a WordPiece tokenization algorithm over SMILES strings using the tokenisation SMILES regex developed by Schwaller et. al.

Please see https://github.com/huggingface/transformers and https://github.com/rxn4chemistry/rxnfp for more details.

Examples

>>> from deepchem.feat.smiles_tokenizer import SmilesTokenizer
>>> current_dir = os.path.dirname(os.path.realpath(__file__))
>>> vocab_path = os.path.join(current_dir, 'tests/data', 'vocab.txt')
>>> tokenizer = SmilesTokenizer(vocab_path)
>>> print(tokenizer.encode("CC(=O)OC1=CC=CC=C1C(=O)O"))
[12, 16, 16, 17, 22, 19, 18, 19, 16, 20, 22, 16, 16, 22, 16, 16, 22, 16, 20, 16, 17, 22, 19, 18, 19, 13]

References

Note

This class requires huggingface’s transformers and tokenizers libraries to be installed.

__init__(vocab_file: str = '', **kwargs)[source]¶

Constructs a SmilesTokenizer.

Parameters:: vocab_file (str) – Path to a SMILES character per line vocabulary file. Default vocab file is found in deepchem/feat/tests/data/vocab.txt

property vocab_size[source]¶

Size of the base vocabulary (without the added tokens).

Type:: int

convert_tokens_to_string(tokens: List[str])[source]¶

Converts a sequence of tokens (string) in a single string.

Parameters:: tokens (List[str]) – List of tokens for a given string sequence.
Returns:: out_string – Single string from combined tokens.
Return type:: str

add_special_tokens_ids_single_sequence(token_ids: List[int | None])[source]¶

Adds special tokens to the a sequence for sequence classification tasks.

A BERT sequence has the following format: [CLS] X [SEP]

Parameters:: token_ids (list[int]) – list of tokenized input ids. Can be obtained using the encode or encode_plus methods.

add_special_tokens_single_sequence(tokens: List[str])[source]¶

Adds special tokens to the a sequence for sequence classification tasks. A BERT sequence has the following format: [CLS] X [SEP]

Parameters:: tokens (List[str]) – List of tokens for a given string sequence.

add_special_tokens_ids_sequence_pair(token_ids_0: List[int | None], token_ids_1: List[int | None]) → List[int | None][source]¶

Adds special tokens to a sequence pair for sequence classification tasks. A BERT sequence pair has the following format: [CLS] A [SEP] B [SEP]

Parameters:

token_ids_0 (List[int]) – List of ids for the first string sequence in the sequence pair (A).
token_ids_1 (List[int]) – List of tokens for the second string sequence in the sequence pair (B).

add_padding_tokens(token_ids: List[int | None], length: int, right: bool = True) → List[int | None][source]¶

Adds padding tokens to return a sequence of length max_length. By default padding tokens are added to the right of the sequence.

Parameters:

token_ids (list[optional[int]]) – list of tokenized input ids. Can be obtained using the encode or encode_plus methods.
length (int) – TODO
right (bool, default True) – TODO

Returns:

TODO

Return type:

List[int]

save_vocabulary(save_directory: str, filename_prefix: str | None = None)[source]¶

Save the tokenizer vocabulary to a file.

Parameters:: vocab_path (obj: str) – The directory in which to save the SMILES character per line vocabulary file. Default vocab file is found in deepchem/feat/tests/data/vocab.txt
Returns:: vocab_file – Paths to the files saved. typle with string to a SMILES character per line vocabulary file. Default vocab file is found in deepchem/feat/tests/data/vocab.txt
Return type:: Tuple

BasicSmilesTokenizer ¶

The dc.feat.BasicSmilesTokenizer module uses a regex tokenization pattern to tokenise SMILES strings. The regex is developed by Schwaller et. al. The tokenizer is to be used on SMILES in cases where the user wishes to not rely on the transformers API.

References:

Molecular Transformer: Unsupervised Attention-Guided Atom-Mapping

class BasicSmilesTokenizer(regex_pattern: str = '(\\[[^\\]]+]|Br?|Cl?|N|O|S|P|F|I|b|c|n|o|s|p|\$|\$|\\.|=|#|-|\\+|\\\\|\\/|:|~|@|\\?|>>?|\\*|\\$|\\%[0-9]{2}|[0-9])')[source]¶

Run basic SMILES tokenization using a regex pattern developed by Schwaller et. al. This tokenizer is to be used when a tokenizer that does not require the transformers library by HuggingFace is required.

Examples

>>> from deepchem.feat.smiles_tokenizer import BasicSmilesTokenizer
>>> tokenizer = BasicSmilesTokenizer()
>>> print(tokenizer.tokenize("CC(=O)OC1=CC=CC=C1C(=O)O"))
['C', 'C', '(', '=', 'O', ')', 'O', 'C', '1', '=', 'C', 'C', '=', 'C', 'C', '=', 'C', '1', 'C', '(', '=', 'O', ')', 'O']

References

__init__(regex_pattern: str = '(\\[[^\\]]+]|Br?|Cl?|N|O|S|P|F|I|b|c|n|o|s|p|\$|\$|\\.|=|#|-|\\+|\\\\|\\/|:|~|@|\\?|>>?|\\*|\\$|\\%[0-9]{2}|[0-9])')[source]¶

Constructs a BasicSMILESTokenizer.

Parameters:: regex (string) – SMILES token regex

tokenize(text)[source]¶: Basic Tokenization of a SMILES.

HuggingFaceFeaturizer ¶

class HuggingFaceFeaturizer(tokenizer: transformers.tokenization_utils_fast.PreTrainedTokenizerFast)[source]¶

Wrapper class that wraps HuggingFace tokenizers as DeepChem featurizers

The HuggingFaceFeaturizer wrapper provides a wrapper around Hugging Face tokenizers allowing them to be used as DeepChem featurizers. This might be useful in scenarios where user needs to use a hugging face tokenizer when loading a dataset.

Example

>>> from deepchem.feat import HuggingFaceFeaturizer
>>> from transformers import RobertaTokenizerFast
>>> hf_tokenizer = RobertaTokenizerFast.from_pretrained("seyonec/PubChem10M_SMILES_BPE_60k")
>>> featurizer = HuggingFaceFeaturizer(tokenizer=hf_tokenizer)
>>> result = featurizer.featurize(['CC(=O)C'])

__init__(tokenizer: transformers.tokenization_utils_fast.PreTrainedTokenizerFast)[source]¶

Initializes a tokenizer wrapper

Parameters:: tokenizer (transformers.tokenization_utils_fast.PreTrainedTokenizerFast) – The tokenizer to use for featurization

featurize(datapoints: Iterable[Any], log_every_n: int = 1000, **kwargs) → ndarray[source]¶

Calculate features for datapoints.

Parameters:

datapoints (Iterable[Any]) – A sequence of objects that you’d like to featurize. Subclassses of Featurizer should instantiate the _featurize method that featurizes objects in the sequence.
log_every_n (int, default 1000) – Logs featurization progress every log_every_n steps.

Returns:

A numpy array containing a featurized representation of datapoints.

Return type:

np.ndarray

GroverAtomVocabTokenizer ¶

class GroverAtomVocabTokenizer(fname: str)[source]¶

Grover Atom Vocabulary Tokenizer

The Grover Atom vocab tokenizer is used for tokenizing an atom using a vocabulary generated by GroverAtomVocabularyBuilder.

Example

>>> import tempfile
>>> import deepchem as dc
>>> from deepchem.feat.vocabulary_builders.grover_vocab import GroverAtomVocabularyBuilder
>>> file = tempfile.NamedTemporaryFile()
>>> dataset = dc.data.NumpyDataset(X=[['CC(=O)C'], ['CCC']])
>>> vocab = GroverAtomVocabularyBuilder()
>>> vocab.build(dataset)
>>> vocab.save(file.name)  # build and save the vocabulary
>>> atom_tokenizer = GroverAtomVocabTokenizer(file.name)
>>> mol = Chem.MolFromSmiles('CC(=O)C')
>>> atom_tokenizer.featurize([(mol, mol.GetAtomWithIdx(0))])[0]
2

Parameters:: fname (str) – Filename of vocabulary generated by GroverAtomVocabularyBuilder

__init__(fname: str)[source]¶

GroverBondVocabTokenizer ¶

class GroverBondVocabTokenizer(fname: str)[source]¶

Grover Bond Vocabulary Tokenizer

The Grover Bond vocab tokenizer is used for tokenizing a bond using a vocabulary generated by GroverBondVocabularyBuilder.

Example

>>> import tempfile
>>> import deepchem as dc
>>> from deepchem.feat.vocabulary_builders.grover_vocab import GroverBondVocabularyBuilder
>>> file = tempfile.NamedTemporaryFile()
>>> dataset = dc.data.NumpyDataset(X=[['CC(=O)C'], ['CCC']])
>>> vocab = GroverBondVocabularyBuilder()
>>> vocab.build(dataset)
>>> vocab.save(file.name)  # build and save the vocabulary
>>> bond_tokenizer = GroverBondVocabTokenizer(file.name)
>>> mol = Chem.MolFromSmiles('CC(=O)C')
>>> bond_tokenizer.featurize([(mol, mol.GetBondWithIdx(0))])[0]
2

Parameters:: fname (str) – Filename of vocabulary generated by GroverAtomVocabularyBuilder

__init__(fname: str)[source]¶

Vocabulary Builders ¶

Tokenizers uses a vocabulary to tokenize the datapoint. To build a vocabulary, an algorithm which generates vocabulary from a corpus is required. A corpus is usually a collection of molecules, DNA sequences etc. DeepChem provides the following algorithms to build vocabulary from a corpus. A vocabulary builder is not a featurizer. It is an utility which helps the tokenizers to featurize datapoints.

class GroverAtomVocabularyBuilder(max_size: int | None = None)[source]¶

Atom Vocabulary Builder for Grover

This module can be used to generate atom vocabulary from SMILES strings for the GROVER pretraining task. For each atom in a molecule, the vocabulary context is the node-edge-count of the atom where node is the neighboring atom, edge is the type of bond (single bond or double bound) and count is the number of such node-edge pairs for the atom in its neighborhood. For example, for the molecule ‘CC(=O)C’, the context of the first carbon atom is C-SINGLE1 because it’s neighbor is C atom, the type of bond is SINGLE bond and the count of such bonds is 1. The context of the second carbon atom is C-SINGLE2 and O-DOUBLE1 because it is connected to two carbon atoms by a single bond and 1 O atom by a double bond. The vocabulary of an atom is then computed as the atom-symbol_contexts where the contexts are sorted in alphabetical order when there are multiple contexts. For example, the vocabulary of second C is C_C-SINGLE2_O-DOUBLE1. The algorithm enumerates vocabulary of all atoms in the dataset and makes a vocabulary to index mapping by sorting the vocabulary by frequency and then alphabetically.

The algorithm enumerates vocabulary of all atoms in the dataset and makes a vocabulary to index mapping by sorting the vocabulary by frequency and then alphabetically. The max_size parameter can be used for setting the size of the vocabulary. When this parameter is set, the algorithm stops adding new words to the index when the vocabulary size reaches max_size.

Parameters:: max_size (int (optional)) – Maximum size of vocabulary

Example

>>> import tempfile
>>> import deepchem as dc
>>> from rdkit import Chem
>>> file = tempfile.NamedTemporaryFile()
>>> dataset = dc.data.NumpyDataset(X=[['CCC'], ['CC(=O)C']])
>>> vocab = GroverAtomVocabularyBuilder()
>>> vocab.build(dataset)
>>> vocab.stoi
{'<pad>': 0, '<other>': 1, 'C_C-SINGLE1': 2, 'C_C-SINGLE2': 3, 'C_C-SINGLE2_O-DOUBLE1': 4, 'O_C-DOUBLE1': 5}
>>> vocab.save(file.name)
>>> loaded_vocab = GroverAtomVocabularyBuilder.load(file.name)
>>> mol = Chem.MolFromSmiles('CC(=O)C')
>>> loaded_vocab.encode(mol, mol.GetAtomWithIdx(1))
4

Reference¶

__init__(max_size: int | None = None)[source]¶

build(dataset: Dataset, log_every_n: int = 1000) → None[source]¶

Builds vocabulary

Parameters:

dataset (dc.data.Dataset) – A dataset object with SMILEs strings in X attribute.
log_every_n (int, default 1000) – Logs vocabulary building progress every log_every_n steps.

build_from_csv(csv_path: str, smiles_field: str, log_every_n: int = 1000) → None[source]¶

Builds vocabulary from csv file

Parameters:

csv_path (str) – Path to csv file containing smiles string
smiles_field (str) – Name of column containing smiles string
log_every_n (int, default 1000) – Logs vocabulary building progress every log_every_n steps.

save(fname: str) → None[source]¶

Saves a vocabulary in json format

Parameter¶

fname: str: Filename to save vocabulary

classmethod load(fname: str) → GroverAtomVocabularyBuilder[source]¶

Loads vocabulary from the specified json file

Parameters:: fname (str) – JSON file containing vocabulary
Returns:: vocab – A grover atom vocabulary builder which can be used for encoding
Return type:: GroverAtomVocabularyBuilder

static atom_to_vocab(mol: Any, atom: Any) → str[source]¶

Convert atom to vocabulary.

Parameters:

mol (RDKitMol) – an molecule object
atom (RDKitAtom) – the target atom.

Returns:

vocab – The generated atom vocabulary with its contexts.

Return type:

str

Example

>>> from rdkit import Chem
>>> mol = Chem.MolFromSmiles('[C@@H](C)C(=O)O')
>>> GroverAtomVocabularyBuilder.atom_to_vocab(mol, mol.GetAtomWithIdx(0))
'C_C-SINGLE2'
>>> GroverAtomVocabularyBuilder.atom_to_vocab(mol, mol.GetAtomWithIdx(3))
'O_C-DOUBLE1'

encode(mol: Any, atom: Any) → str[source]¶

Encodes an atom in a molecule

Parameter¶

mol: RDKitMol: An RDKitMol object
atom: RDKitAtom: An atom in the molecule

returns:: vocab – The vocabulary of the atom in the molecule.
rtype:: str

class GroverAtomVocabularyBuilder(max_size: int | None = None)[source]¶

Atom Vocabulary Builder for Grover

This module can be used to generate atom vocabulary from SMILES strings for the GROVER pretraining task. For each atom in a molecule, the vocabulary context is the node-edge-count of the atom where node is the neighboring atom, edge is the type of bond (single bond or double bound) and count is the number of such node-edge pairs for the atom in its neighborhood. For example, for the molecule ‘CC(=O)C’, the context of the first carbon atom is C-SINGLE1 because it’s neighbor is C atom, the type of bond is SINGLE bond and the count of such bonds is 1. The context of the second carbon atom is C-SINGLE2 and O-DOUBLE1 because it is connected to two carbon atoms by a single bond and 1 O atom by a double bond. The vocabulary of an atom is then computed as the atom-symbol_contexts where the contexts are sorted in alphabetical order when there are multiple contexts. For example, the vocabulary of second C is C_C-SINGLE2_O-DOUBLE1. The algorithm enumerates vocabulary of all atoms in the dataset and makes a vocabulary to index mapping by sorting the vocabulary by frequency and then alphabetically.

The algorithm enumerates vocabulary of all atoms in the dataset and makes a vocabulary to index mapping by sorting the vocabulary by frequency and then alphabetically. The max_size parameter can be used for setting the size of the vocabulary. When this parameter is set, the algorithm stops adding new words to the index when the vocabulary size reaches max_size.

Parameters:: max_size (int (optional)) – Maximum size of vocabulary

Example

>>> import tempfile
>>> import deepchem as dc
>>> from rdkit import Chem
>>> file = tempfile.NamedTemporaryFile()
>>> dataset = dc.data.NumpyDataset(X=[['CCC'], ['CC(=O)C']])
>>> vocab = GroverAtomVocabularyBuilder()
>>> vocab.build(dataset)
>>> vocab.stoi
{'<pad>': 0, '<other>': 1, 'C_C-SINGLE1': 2, 'C_C-SINGLE2': 3, 'C_C-SINGLE2_O-DOUBLE1': 4, 'O_C-DOUBLE1': 5}
>>> vocab.save(file.name)
>>> loaded_vocab = GroverAtomVocabularyBuilder.load(file.name)
>>> mol = Chem.MolFromSmiles('CC(=O)C')
>>> loaded_vocab.encode(mol, mol.GetAtomWithIdx(1))
4

Reference¶

__init__(max_size: int | None = None)[source]¶

build(dataset: Dataset, log_every_n: int = 1000) → None[source]¶

Builds vocabulary

Parameters:

dataset (dc.data.Dataset) – A dataset object with SMILEs strings in X attribute.
log_every_n (int, default 1000) – Logs vocabulary building progress every log_every_n steps.

build_from_csv(csv_path: str, smiles_field: str, log_every_n: int = 1000) → None[source]¶

Builds vocabulary from csv file

Parameters:

csv_path (str) – Path to csv file containing smiles string
smiles_field (str) – Name of column containing smiles string
log_every_n (int, default 1000) – Logs vocabulary building progress every log_every_n steps.

save(fname: str) → None[source]¶

Saves a vocabulary in json format

Parameter¶

fname: str: Filename to save vocabulary

classmethod load(fname: str) → GroverAtomVocabularyBuilder[source]¶

Loads vocabulary from the specified json file

Parameters:: fname (str) – JSON file containing vocabulary
Returns:: vocab – A grover atom vocabulary builder which can be used for encoding
Return type:: GroverAtomVocabularyBuilder

static atom_to_vocab(mol: Any, atom: Any) → str[source]¶

Convert atom to vocabulary.

Parameters:

mol (RDKitMol) – an molecule object
atom (RDKitAtom) – the target atom.

Returns:

vocab – The generated atom vocabulary with its contexts.

Return type:

str

Example

>>> from rdkit import Chem
>>> mol = Chem.MolFromSmiles('[C@@H](C)C(=O)O')
>>> GroverAtomVocabularyBuilder.atom_to_vocab(mol, mol.GetAtomWithIdx(0))
'C_C-SINGLE2'
>>> GroverAtomVocabularyBuilder.atom_to_vocab(mol, mol.GetAtomWithIdx(3))
'O_C-DOUBLE1'

encode(mol: Any, atom: Any) → str[source]¶

Encodes an atom in a molecule

Parameter¶

mol: RDKitMol: An RDKitMol object
atom: RDKitAtom: An atom in the molecule

returns:: vocab – The vocabulary of the atom in the molecule.
rtype:: str

Sequence Featurizers ¶

PFMFeaturizer ¶

The dc.feat.PFMFeaturizer module implements a featurizer for position frequency matrices. This takes in a list of multisequence alignments and returns a list of position frequency matrices.

class PFMFeaturizer(charset: List[str] = ['A', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'K', 'L', 'M', 'N', 'P', 'Q', 'R', 'S', 'T', 'V', 'W', 'Y', 'X', 'Z', 'B', 'U', 'O'], max_length: int | None = 100)[source]¶

Encodes a list position frequency matrices for a given list of multiple sequence alignments

The default character set is 25 amino acids. If you want to use a different character set, such as nucleotides, simply pass in a list of character strings in the featurizer constructor.

The max_length parameter is the maximum length of the sequences to be featurized. If you want to featurize longer sequences, modify the max_length parameter in the featurizer constructor.

The final row in the position frequency matrix is the unknown set, if there are any characters which are not included in the charset.

Examples

>>> from deepchem.feat.sequence_featurizers import PFMFeaturizer
>>> from deepchem.data import NumpyDataset
>>> msa = NumpyDataset(X=[['ABC','BCD'],['AAA','AAB']], ids=[['seq01','seq02'],['seq11','seq12']])
>>> seqs = msa.X
>>> featurizer = PFMFeaturizer()
>>> pfm = featurizer.featurize(seqs)
>>> pfm.shape
(2, 26, 100)

__init__(charset: List[str] = ['A', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'K', 'L', 'M', 'N', 'P', 'Q', 'R', 'S', 'T', 'V', 'W', 'Y', 'X', 'Z', 'B', 'U', 'O'], max_length: int | None = 100)[source]¶

Initialize featurizer.

Parameters:

charset (List[str] (default CHARSET)) – A list of strings, where each string is length 1 and unique.
max_length (int, optional (default 25)) – Maximum length of sequences to be featurized.

Other Featurizers ¶

BertFeaturizer ¶

class BertFeaturizer(tokenizer: BertTokenizerFast)[source]¶

Bert Featurizer.

Bert Featurizer. The Bert Featurizer is a wrapper class for HuggingFace’s BertTokenizerFast. This class intends to allow users to use the BertTokenizer API while remaining inside the DeepChem ecosystem.

Examples

>>> from deepchem.feat import BertFeaturizer
>>> from transformers import BertTokenizerFast
>>> tokenizer = BertTokenizerFast.from_pretrained("Rostlab/prot_bert", do_lower_case=False)
>>> featurizer = BertFeaturizer(tokenizer)
>>> feats = featurizer.featurize(['D L I P [MASK] L V T'])

Notes

Examples are based on RostLab’s ProtBert documentation.

__init__(tokenizer: BertTokenizerFast)[source]¶

featurize(datapoints: Iterable[Any], log_every_n: int = 1000, **kwargs) → ndarray[source]¶

Calculate features for datapoints.

Parameters:

datapoints (Iterable[Any]) – A sequence of objects that you’d like to featurize. Subclassses of Featurizer should instantiate the _featurize method that featurizes objects in the sequence.
log_every_n (int, default 1000) – Logs featurization progress every log_every_n steps.

Returns:

A numpy array containing a featurized representation of datapoints.

Return type:

np.ndarray

RobertaFeaturizer ¶

class RobertaFeaturizer(**kwargs)[source]¶

Roberta Featurizer.

The Roberta Featurizer is a wrapper class of the Roberta Tokenizer, which is used by Huggingface’s transformers library for tokenizing large corpuses for Roberta Models. Please confirm the details in [1]_.

Please see https://github.com/huggingface/transformers and https://github.com/seyonechithrananda/bert-loves-chemistry for more details.

Examples

>>> from deepchem.feat import RobertaFeaturizer
>>> smiles = ["Cn1c(=O)c2c(ncn2C)n(C)c1=O", "CC(=O)N1CN(C(C)=O)C(O)C1O"]
>>> featurizer = RobertaFeaturizer.from_pretrained("seyonec/SMILES_tokenized_PubChem_shard00_160k")
>>> out = featurizer(smiles, add_special_tokens=True, truncation=True)

References

Note

This class requires transformers to be installed. RobertaFeaturizer uses dual inheritance with RobertaTokenizerFast in Huggingface for rapid tokenization, as well as DeepChem’s Featurizer class.

__init__(**kwargs)[source]¶

__bool__() → bool[source]¶: Returns True, to avoid expensive assert tokenizer gotchas.

__len__() → int[source]¶: Size of the full vocabulary with the added tokens.

__setattr__(key, value)[source]¶: Implement setattr(self, name, value).

add_special_tokens(special_tokens_dict: dict[str, str | AddedToken | Sequence[str | AddedToken]], replace_additional_special_tokens=True) → int[source]¶

Add a dictionary of special tokens (eos, pad, cls, etc.) to the encoder and link them to class attributes. If special tokens are NOT in the vocabulary, they are added to it (indexed starting from the last index of the current vocabulary).

When adding new tokens to the vocabulary, you should make sure to also resize the token embedding matrix of the model so that its embedding matrix matches the tokenizer.

In order to do that, please use the [~PreTrainedModel.resize_token_embeddings] method.

Using add_special_tokens will ensure your special tokens can be used in several ways:

Special tokens can be skipped when decoding using skip_special_tokens = True.
Special tokens are carefully handled by the tokenizer (they are never split), similar to AddedTokens.
You can easily refer to special tokens using tokenizer class attributes like tokenizer.cls_token. This makes it easy to develop model-agnostic training and fine-tuning scripts.

When possible, special tokens are already registered for provided pretrained models (for instance [BertTokenizer] cls_token is already registered to be ‘[CLS]’ and XLM’s one is also registered to be ‘</s>’).

Parameters:

special_tokens_dict (dictionary str to str, tokenizers.AddedToken, or Sequence[Union[str, AddedToken]]) –
Keys should be in the list of predefined special attributes: [bos_token, eos_token, unk_token, sep_token, pad_token, cls_token, mask_token, additional_special_tokens].

Tokens are only added if they are not already in the vocabulary (tested by checking if the tokenizer assign the index of the unk_token to them).
replace_additional_special_tokens (bool, optional,, defaults to True) – If True, the existing list of additional special tokens will be replaced by the list provided in special_tokens_dict. Otherwise, self._special_tokens_map[“additional_special_tokens”] is just extended. In the former case, the tokens will NOT be removed from the tokenizer’s full vocabulary - they are only being flagged as non-special tokens. Remember, this only affects which tokens are skipped during decoding, not the added_tokens_encoder and added_tokens_decoder. This means that the previous additional_special_tokens are still added tokens, and will not be split by the model.

Returns:

Number of tokens added to the vocabulary.

Return type:

int

Examples:

```python # Let’s see how to add a new classification token to GPT-2 tokenizer = GPT2Tokenizer.from_pretrained(“openai-community/gpt2”) model = GPT2Model.from_pretrained(“openai-community/gpt2”)

special_tokens_dict = {“cls_token”: “<CLS>”}

num_added_toks = tokenizer.add_special_tokens(special_tokens_dict) print(“We have added”, num_added_toks, “tokens”) # Notice: resize_token_embeddings expect to receive the full size of the new vocabulary, i.e., the length of the tokenizer. model.resize_token_embeddings(len(tokenizer))

assert tokenizer.cls_token == “<CLS>” ```

add_tokens(new_tokens: str | AddedToken | Sequence[str | AddedToken], special_tokens: bool = False) → int[source]¶

Add a list of new tokens to the tokenizer class. If the new tokens are not in the vocabulary, they are added to it with indices starting from length of the current vocabulary and will be isolated before the tokenization algorithm is applied. Added tokens and tokens from the vocabulary of the tokenization algorithm are therefore not treated in the same way.

Note, when adding new tokens to the vocabulary, you should make sure to also resize the token embedding matrix of the model so that its embedding matrix matches the tokenizer.

In order to do that, please use the [~PreTrainedModel.resize_token_embeddings] method.

Parameters:

new_tokens (str, tokenizers.AddedToken or a sequence of str or tokenizers.AddedToken) – Tokens are only added if they are not already in the vocabulary. tokenizers.AddedToken wraps a string token to let you personalize its behavior: whether this token should only match against a single word, whether this token should strip all potential whitespaces on the left side, whether this token should strip all potential whitespaces on the right side, etc.
special_tokens (bool, optional, defaults to False) –
Can be used to specify if the token is a special token. This mostly change the normalization behavior (special tokens like CLS or [MASK] are usually not lower-cased for instance).

See details for tokenizers.AddedToken in HuggingFace tokenizers library.

Returns:

Number of tokens added to the vocabulary.

Return type:

int

Examples:

```python # Let’s see how to increase the vocabulary of Bert model and tokenizer tokenizer = BertTokenizerFast.from_pretrained(“google-bert/bert-base-uncased”) model = BertModel.from_pretrained(“google-bert/bert-base-uncased”)

num_added_toks = tokenizer.add_tokens([“new_tok1”, “my_new-tok2”]) print(“We have added”, num_added_toks, “tokens”) # Notice: resize_token_embeddings expect to receive the full size of the new vocabulary, i.e., the length of the tokenizer. model.resize_token_embeddings(len(tokenizer)) ```

property added_tokens_decoder: dict[int, AddedToken][source]¶

Returns the added tokens in the vocabulary as a dictionary of index to AddedToken.

Returns:: The added tokens.
Return type:: dict[str, int]

property added_tokens_encoder: dict[str, int][source]¶: Returns the sorted mapping from string to index. The added tokens encoder is cached for performance optimisation in self._added_tokens_encoder for the slow tokenizers.

property all_special_ids: list[int][source]¶

List the ids of the special tokens(‘<unk>’, ‘<cls>’, etc.) mapped to class attributes.

Type:: list[int]

property all_special_tokens: list[str][source]¶

A list of the unique special tokens (‘<unk>’, ‘<cls>’, …, etc.).

Convert tokens of tokenizers.AddedToken type to string.

Type:: list[str]

property all_special_tokens_extended: list[str | AddedToken][source]¶

All the special tokens (‘<unk>’, ‘<cls>’, etc.), the order has nothing to do with the index of each tokens. If you want to know the correct indices, check self.added_tokens_encoder. We can’t create an order anymore as the keys are AddedTokens and not Strings.

Don’t convert tokens of tokenizers.AddedToken type to string so they can be used to control more finely how special tokens are tokenized.

Type:: list[Union[str, tokenizers.AddedToken]]

Converts a list of dictionaries with “role” and “content” keys to a list of token ids. This method is intended for use with chat models, and will read the tokenizer’s chat_template attribute to determine the format and control tokens to use when converting.

Parameters:

conversation (Union[list[dict[str, str]], list[list[dict[str, str]]]]) – A list of dicts with “role” and “content” keys, representing the chat history so far.
tools (list[Union[Dict, Callable]], optional) – A list of tools (callable functions) that will be accessible to the model. If the template does not support function calling, this argument will have no effect. Each tool should be passed as a JSON Schema, giving the name, description and argument types for the tool. See our [tool use guide](https://huggingface.co/docs/transformers/en/chat_extras#passing-tools) for more information.
documents (list[dict[str, str]], optional) – A list of dicts representing documents that will be accessible to the model if it is performing RAG (retrieval-augmented generation). If the template does not support RAG, this argument will have no effect. We recommend that each document should be a dict containing “title” and “text” keys.
chat_template (str, optional) – A Jinja template to use for this conversion. It is usually not necessary to pass anything to this argument, as the model’s template will be used by default.
add_generation_prompt (bool, optional) – If this is set, a prompt with the token(s) that indicate the start of an assistant message will be appended to the formatted output. This is useful when you want to generate a response from the model. Note that this argument will be passed to the chat template, and so it must be supported in the template for this argument to have any effect.
continue_final_message (bool, optional) – If this is set, the chat will be formatted so that the final message in the chat is open-ended, without any EOS tokens. The model will continue this message rather than starting a new one. This allows you to “prefill” part of the model’s response for it. Cannot be used at the same time as add_generation_prompt.
tokenize (bool, defaults to True) – Whether to tokenize the output. If False, the output will be a string.
padding (bool, str or [~utils.PaddingStrategy], optional, defaults to False) –

Select a strategy to pad the returned sequences (according to the model’s padding side and padding
index) among:
- True or ‘longest’: Pad to the longest sequence in the batch (or no padding if only a single sequence if provided).
- ’max_length’: Pad to a maximum length specified with the argument max_length or to the maximum acceptable input length for the model if that argument is not provided.
- False or ‘do_not_pad’ (default): No padding (i.e., can output a batch with sequences of different lengths).
truncation (bool, defaults to False) – Whether to truncate sequences at the maximum length. Has no effect if tokenize is False.
max_length (int, optional) – Maximum length (in tokens) to use for padding or truncation. Has no effect if tokenize is False. If not specified, the tokenizer’s max_length attribute will be used as a default.
return_tensors (str or [~utils.TensorType], optional) – If set, will return tensors of a particular framework. Has no effect if tokenize is False. Acceptable values are: - ‘tf’: Return TensorFlow tf.Tensor objects. - ‘pt’: Return PyTorch torch.Tensor objects. - ‘np’: Return NumPy np.ndarray objects. - ‘jax’: Return JAX jnp.ndarray objects.
return_dict (bool, defaults to False) – Whether to return a dictionary with named outputs. Has no effect if tokenize is False.
(`dict[str (tokenizer_kwargs) –
Any]`, optional): Additional kwargs to pass to the tokenizer.
return_assistant_tokens_mask (bool, defaults to False) – Whether to return a mask of the assistant generated tokens. For tokens generated by the assistant, the mask will contain 1. For user and system tokens, the mask will contain 0. This functionality is only available for chat templates that support it via the {% generation %} keyword.
**kwargs – Additional kwargs to pass to the template renderer. Will be accessible by the chat template.

Returns:

A list of token ids representing the tokenized chat so far, including control tokens. This output is ready to pass to the model, either directly or via methods like generate(). If return_dict is set, will return a dict of tokenizer outputs instead.

Return type:

Union[list[int], Dict]

as_target_tokenizer()[source]¶: Temporarily sets the tokenizer for encoding the targets. Useful for tokenizer associated to sequence-to-sequence models that need a slightly different processing for the labels.

property backend_tokenizer: Tokenizer[source]¶

The Rust tokenizer used as a backend.

Type:: tokenizers.implementations.BaseTokenizer

Convert a list of lists of token ids into a list of strings by calling decode.

Parameters:

sequences (Union[list[int], list[list[int]], np.ndarray, torch.Tensor, tf.Tensor]) – List of tokenized input ids. Can be obtained using the __call__ method.
skip_special_tokens (bool, optional, defaults to False) – Whether or not to remove special tokens in the decoding.
clean_up_tokenization_spaces (bool, optional) – Whether or not to clean up the tokenization spaces. If None, will default to self.clean_up_tokenization_spaces.
kwargs (additional keyword arguments, optional) – Will be passed to the underlying model specific decode method.

Returns:

The list of decoded sentences.

Return type:

list[str]

batch_encode_plus(batch_text_or_text_pairs: list[str] | list[tuple[str, str]] | list[list[str]] | list[tuple[list[str], list[str]]] | list[list[int]] | list[tuple[list[int], list[int]]], add_special_tokens: bool = True, padding: bool | str | PaddingStrategy = False, truncation: bool | str | TruncationStrategy | None = None, max_length: int | None = None, stride: int = 0, is_split_into_words: bool = False, pad_to_multiple_of: int | None = None, padding_side: str | None = None, return_tensors: str | TensorType | None = None, return_token_type_ids: bool | None = None, return_attention_mask: bool | None = None, return_overflowing_tokens: bool = False, return_special_tokens_mask: bool = False, return_offsets_mapping: bool = False, return_length: bool = False, verbose: bool = True, split_special_tokens: bool = False, **kwargs) → BatchEncoding[source]¶

Tokenize and prepare for the model a list of sequences or a list of pairs of sequences.

This method is deprecated, __call__ should be used instead.

</Tip>

Parameters:

batch_text_or_text_pairs (list[str], list[tuple[str, str]], list[list[str]], list[tuple[list[str], list[str]]], and for not-fast tokenizers, also list[list[int]], list[tuple[list[int], list[int]]]) – Batch of sequences or pair of sequences to be encoded. This can be a list of string/string-sequences/int-sequences or a list of pair of string/string-sequences/int-sequence (see details in encode_plus).
add_special_tokens (bool, optional, defaults to True) – Whether or not to add special tokens when encoding the sequences. This will use the underlying PretrainedTokenizerBase.build_inputs_with_special_tokens function, which defines which tokens are automatically added to the input ids. This is useful if you want to add bos or eos tokens automatically.
padding (bool, str or [~utils.PaddingStrategy], optional, defaults to False) –
Activates and controls padding. Accepts the following values:
- True or ‘longest’: Pad to the longest sequence in the batch (or no padding if only a single sequence is provided).
- ’max_length’: Pad to a maximum length specified with the argument max_length or to the maximum acceptable input length for the model if that argument is not provided.
- False or ‘do_not_pad’ (default): No padding (i.e., can output a batch with sequences of different lengths).
truncation (bool, str or [~tokenization_utils_base.TruncationStrategy], optional, defaults to False) –
Activates and controls truncation. Accepts the following values:
- True or ‘longest_first’: Truncate to a maximum length specified with the argument max_length or to the maximum acceptable input length for the model if that argument is not provided. This will truncate token by token, removing a token from the longest sequence in the pair if a pair of sequences (or a batch of pairs) is provided.
- ’only_first’: Truncate to a maximum length specified with the argument max_length or to the maximum acceptable input length for the model if that argument is not provided. This will only truncate the first sequence of a pair if a pair of sequences (or a batch of pairs) is provided.
- ’only_second’: Truncate to a maximum length specified with the argument max_length or to the maximum acceptable input length for the model if that argument is not provided. This will only truncate the second sequence of a pair if a pair of sequences (or a batch of pairs) is provided.
- False or ‘do_not_truncate’ (default): No truncation (i.e., can output batch with sequence lengths greater than the model maximum admissible input size).
max_length (int, optional) –
Controls the maximum length to use by one of the truncation/padding parameters.

If left unset or set to None, this will use the predefined model maximum length if a maximum length is required by one of the truncation/padding parameters. If the model has no specific maximum input length (like XLNet) truncation/padding to a maximum length will be deactivated.
stride (int, optional, defaults to 0) – If set to a number along with max_length, the overflowing tokens returned when return_overflowing_tokens=True will contain some tokens from the end of the truncated sequence returned to provide some overlap between truncated and overflowing sequences. The value of this argument defines the number of overlapping tokens.
is_split_into_words (bool, optional, defaults to False) – Whether or not the input is already pre-tokenized (e.g., split into words). If set to True, the tokenizer assumes the input is already split into words (for instance, by splitting it on whitespace) which it will tokenize. This is useful for NER or token classification.
pad_to_multiple_of (int, optional) – If set will pad the sequence to a multiple of the provided value. Requires padding to be activated. This is especially useful to enable the use of Tensor Cores on NVIDIA hardware with compute capability >= 7.5 (Volta).
padding_side (str, optional) – The side on which the model should have padding applied. Should be selected between [‘right’, ‘left’]. Default value is picked from the class attribute of the same name.
return_tensors (str or [~utils.TensorType], optional) –
If set, will return tensors instead of list of python integers. Acceptable values are:
- ’tf’: Return TensorFlow tf.constant objects.
- ’pt’: Return PyTorch torch.Tensor objects.
- ’np’: Return Numpy np.ndarray objects.
return_token_type_ids (bool, optional) –
Whether to return token type IDs. If left to the default, will return the token type IDs according to the specific tokenizer’s default, defined by the return_outputs attribute.

[What are token type IDs?](../glossary#token-type-ids)
return_attention_mask (bool, optional) –
Whether to return the attention mask. If left to the default, will return the attention mask according to the specific tokenizer’s default, defined by the return_outputs attribute.

[What are attention masks?](../glossary#attention-mask)
return_overflowing_tokens (bool, optional, defaults to False) – Whether or not to return overflowing token sequences. If a pair of sequences of input ids (or a batch of pairs) is provided with truncation_strategy = longest_first or True, an error is raised instead of returning overflowing tokens.
return_special_tokens_mask (bool, optional, defaults to False) – Whether or not to return special tokens mask information.
return_offsets_mapping (bool, optional, defaults to False) –
Whether or not to return (char_start, char_end) for each token.

This is only available on fast tokenizers inheriting from [PreTrainedTokenizerFast], if using Python’s tokenizer, this method will raise NotImplementedError.
return_length (bool, optional, defaults to False) – Whether or not to return the lengths of the encoded inputs.
verbose (bool, optional, defaults to True) – Whether or not to print more information and warnings.
**kwargs – passed to the self.tokenize() method

Returns:

A [BatchEncoding] with the following fields:

input_ids – List of token ids to be fed to a model.

[What are input IDs?](../glossary#input-ids)
token_type_ids – List of token type ids to be fed to a model (when return_token_type_ids=True or if “token_type_ids” is in self.model_input_names).

[What are token type IDs?](../glossary#token-type-ids)
attention_mask – List of indices specifying which tokens should be attended to by the model (when return_attention_mask=True or if “attention_mask” is in self.model_input_names).

[What are attention masks?](../glossary#attention-mask)
overflowing_tokens – List of overflowing tokens sequences (when a max_length is specified and return_overflowing_tokens=True).
num_truncated_tokens – Number of tokens truncated (when a max_length is specified and return_overflowing_tokens=True).
special_tokens_mask – List of 0s and 1s, with 1 specifying added special tokens and 0 specifying regular sequence tokens (when add_special_tokens=True and return_special_tokens_mask=True).
length – The length of the inputs (when return_length=True)

Return type:

[BatchEncoding]

build_inputs_with_special_tokens(token_ids_0, token_ids_1=None)[source]¶

Build model inputs from a sequence or a pair of sequence for sequence classification tasks by concatenating and adding special tokens.

This implementation does not add special tokens and this method should be overridden in a subclass.

Parameters:

token_ids_0 (list[int]) – The first tokenized sequence.
token_ids_1 (list[int], optional) – The second tokenized sequence.

Returns:

The model input with special tokens.

Return type:

list[int]

property can_save_slow_tokenizer: bool[source]¶

Whether or not the slow tokenizer can be saved. For a sentencepiece based slow tokenizer, this can only be True if the original “sentencepiece.model” was not deleted.

Type:: bool

static clean_up_tokenization(out_string: str) → str[source]¶

Clean up a list of simple English tokenization artifacts like spaces before punctuations and abbreviated forms.

Parameters:: out_string (str) – The text to clean up.
Returns:: The cleaned-up string.
Return type:: str

convert_ids_to_tokens(ids: int | list[int], skip_special_tokens: bool = False) → str | list[str][source]¶

Converts a single index or a sequence of indices in a token or a sequence of tokens, using the vocabulary and added tokens.

Parameters:

ids (int or list[int]) – The token id (or token ids) to convert to tokens.
skip_special_tokens (bool, optional, defaults to False) – Whether or not to remove special tokens in the decoding.

Returns:

The decoded token(s).

Return type:

str or list[str]

convert_tokens_to_ids(tokens: str | Iterable[str]) → int | list[int][source]¶

Converts a token string (or a sequence of tokens) in a single integer id (or a Iterable of ids), using the vocabulary.

Parameters:: tokens (str or Iterable[str]) – One or several token(s) to convert to token id(s).
Returns:: The token id or list of token ids.
Return type:: int or list[int]

convert_tokens_to_string(tokens: list[str]) → str[source]¶

Converts a sequence of tokens in a single string. The most simple way to do it is “ “.join(tokens) but we often want to remove sub-word tokenization artifacts at the same time.

Parameters:: tokens (list[str]) – The token to join in a string.
Returns:: The joined tokens.
Return type:: str

create_token_type_ids_from_sequences(token_ids_0: list[int], token_ids_1: list[int] | None = None) → list[int][source]¶

Create a mask from the two sequences passed to be used in a sequence-pair classification task. RoBERTa does not make use of token type ids, therefore a list of zeros is returned.

Parameters:

token_ids_0 (list[int]) – List of IDs.
token_ids_1 (list[int], optional) – Optional second list of IDs for sequence pairs.

Returns:

List of zeros.

Return type:

list[int]

decode(token_ids: int | list[int] | ndarray | torch.Tensor, skip_special_tokens: bool = False, clean_up_tokenization_spaces: bool | None = None, **kwargs) → str[source]¶

Converts a sequence of ids in a string, using the tokenizer and vocabulary with options to remove special tokens and clean up tokenization spaces.

Similar to doing self.convert_tokens_to_string(self.convert_ids_to_tokens(token_ids)).

Parameters:

token_ids (Union[int, list[int], np.ndarray, torch.Tensor, tf.Tensor]) – List of tokenized input ids. Can be obtained using the __call__ method.
skip_special_tokens (bool, optional, defaults to False) – Whether or not to remove special tokens in the decoding.
clean_up_tokenization_spaces (bool, optional) – Whether or not to clean up the tokenization spaces. If None, will default to self.clean_up_tokenization_spaces.
kwargs (additional keyword arguments, optional) – Will be passed to the underlying model specific decode method.

Returns:

The decoded sentence.

Return type:

str

property decoder: Decoder[source]¶

The Rust decoder for this tokenizer.

Type:: tokenizers.decoders.Decoder

Converts a string to a sequence of ids (integer), using the tokenizer and vocabulary.

Same as doing self.convert_tokens_to_ids(self.tokenize(text)).

Parameters:

text (str, list[str] or list[int]) – The first sequence to be encoded. This can be a string, a list of strings (tokenized string using the tokenize method) or a list of integers (tokenized string ids using the convert_tokens_to_ids method).
text_pair (str, list[str] or list[int], optional) – Optional second sequence to be encoded. This can be a string, a list of strings (tokenized string using the tokenize method) or a list of integers (tokenized string ids using the convert_tokens_to_ids method).
add_special_tokens (bool, optional, defaults to True) – Whether or not to add special tokens when encoding the sequences. This will use the underlying PretrainedTokenizerBase.build_inputs_with_special_tokens function, which defines which tokens are automatically added to the input ids. This is useful if you want to add bos or eos tokens automatically.
padding (bool, str or [~utils.PaddingStrategy], optional, defaults to False) –
Activates and controls padding. Accepts the following values:
- True or ‘longest’: Pad to the longest sequence in the batch (or no padding if only a single sequence is provided).
- ’max_length’: Pad to a maximum length specified with the argument max_length or to the maximum acceptable input length for the model if that argument is not provided.
- False or ‘do_not_pad’ (default): No padding (i.e., can output a batch with sequences of different lengths).
truncation (bool, str or [~tokenization_utils_base.TruncationStrategy], optional, defaults to False) –
Activates and controls truncation. Accepts the following values:
- True or ‘longest_first’: Truncate to a maximum length specified with the argument max_length or to the maximum acceptable input length for the model if that argument is not provided. This will truncate token by token, removing a token from the longest sequence in the pair if a pair of sequences (or a batch of pairs) is provided.
- ’only_first’: Truncate to a maximum length specified with the argument max_length or to the maximum acceptable input length for the model if that argument is not provided. This will only truncate the first sequence of a pair if a pair of sequences (or a batch of pairs) is provided.
- ’only_second’: Truncate to a maximum length specified with the argument max_length or to the maximum acceptable input length for the model if that argument is not provided. This will only truncate the second sequence of a pair if a pair of sequences (or a batch of pairs) is provided.
- False or ‘do_not_truncate’ (default): No truncation (i.e., can output batch with sequence lengths greater than the model maximum admissible input size).
max_length (int, optional) –
Controls the maximum length to use by one of the truncation/padding parameters.

If left unset or set to None, this will use the predefined model maximum length if a maximum length is required by one of the truncation/padding parameters. If the model has no specific maximum input length (like XLNet) truncation/padding to a maximum length will be deactivated.
stride (int, optional, defaults to 0) – If set to a number along with max_length, the overflowing tokens returned when return_overflowing_tokens=True will contain some tokens from the end of the truncated sequence returned to provide some overlap between truncated and overflowing sequences. The value of this argument defines the number of overlapping tokens.
is_split_into_words (bool, optional, defaults to False) – Whether or not the input is already pre-tokenized (e.g., split into words). If set to True, the tokenizer assumes the input is already split into words (for instance, by splitting it on whitespace) which it will tokenize. This is useful for NER or token classification.
pad_to_multiple_of (int, optional) – If set will pad the sequence to a multiple of the provided value. Requires padding to be activated. This is especially useful to enable the use of Tensor Cores on NVIDIA hardware with compute capability >= 7.5 (Volta).
padding_side (str, optional) – The side on which the model should have padding applied. Should be selected between [‘right’, ‘left’]. Default value is picked from the class attribute of the same name.
return_tensors (str or [~utils.TensorType], optional) –
If set, will return tensors instead of list of python integers. Acceptable values are:
- ’tf’: Return TensorFlow tf.constant objects.
- ’pt’: Return PyTorch torch.Tensor objects.
- ’np’: Return Numpy np.ndarray objects.
**kwargs – Passed along to the .tokenize() method.

Returns:

The tokenized ids of the text.

Return type:

list[int], torch.Tensor, tf.Tensor or np.ndarray

encode_message_with_chat_template(message: dict[str, str], conversation_history: list[dict[str, str]] | None = None, **kwargs) → list[int][source]¶

Tokenize a single message. This method is a convenience wrapper around apply_chat_template that allows you to tokenize messages one by one. This is useful for things like token-by-token streaming. This method is not guaranteed to be perfect. For some models, it may be impossible to robustly tokenize single messages. For example, if the chat template adds tokens after each message, but also has a prefix that is added to the entire chat, it will be impossible to distinguish a chat-start-token from a message-start-token. In these cases, this method will do its best to find the correct tokenization, but it may not be perfect. Note: This method does not support add_generation_prompt. If you want to add a generation prompt, you should do it separately after tokenizing the conversation. :param message: A dictionary with “role” and “content” keys, representing the message to tokenize. :type message: dict :param conversation_history: A list of dicts with “role” and “content” keys, representing the chat history so far. If you are

tokenizing messages one by one, you should pass the previous messages in the conversation here.

Parameters:: **kwargs – Additional kwargs to pass to the apply_chat_template method.
Returns:: A list of token ids representing the tokenized message.
Return type:: list[int]

Tokenize and prepare for the model a sequence or a pair of sequences.

This method is deprecated, __call__ should be used instead.

</Tip>

Parameters:

text (str, list[str] or (for non-fast tokenizers) list[int]) – The first sequence to be encoded. This can be a string, a list of strings (tokenized string using the tokenize method) or a list of integers (tokenized string ids using the convert_tokens_to_ids method).
text_pair (str, list[str] or list[int], optional) – Optional second sequence to be encoded. This can be a string, a list of strings (tokenized string using the tokenize method) or a list of integers (tokenized string ids using the convert_tokens_to_ids method).
add_special_tokens (bool, optional, defaults to True) – Whether or not to add special tokens when encoding the sequences. This will use the underlying PretrainedTokenizerBase.build_inputs_with_special_tokens function, which defines which tokens are automatically added to the input ids. This is useful if you want to add bos or eos tokens automatically.
padding (bool, str or [~utils.PaddingStrategy], optional, defaults to False) –
Activates and controls padding. Accepts the following values:
- True or ‘longest’: Pad to the longest sequence in the batch (or no padding if only a single sequence is provided).
- ’max_length’: Pad to a maximum length specified with the argument max_length or to the maximum acceptable input length for the model if that argument is not provided.
- False or ‘do_not_pad’ (default): No padding (i.e., can output a batch with sequences of different lengths).
truncation (bool, str or [~tokenization_utils_base.TruncationStrategy], optional, defaults to False) –
Activates and controls truncation. Accepts the following values:
- True or ‘longest_first’: Truncate to a maximum length specified with the argument max_length or to the maximum acceptable input length for the model if that argument is not provided. This will truncate token by token, removing a token from the longest sequence in the pair if a pair of sequences (or a batch of pairs) is provided.
- ’only_first’: Truncate to a maximum length specified with the argument max_length or to the maximum acceptable input length for the model if that argument is not provided. This will only truncate the first sequence of a pair if a pair of sequences (or a batch of pairs) is provided.
- ’only_second’: Truncate to a maximum length specified with the argument max_length or to the maximum acceptable input length for the model if that argument is not provided. This will only truncate the second sequence of a pair if a pair of sequences (or a batch of pairs) is provided.
- False or ‘do_not_truncate’ (default): No truncation (i.e., can output batch with sequence lengths greater than the model maximum admissible input size).
max_length (int, optional) –
Controls the maximum length to use by one of the truncation/padding parameters.

If left unset or set to None, this will use the predefined model maximum length if a maximum length is required by one of the truncation/padding parameters. If the model has no specific maximum input length (like XLNet) truncation/padding to a maximum length will be deactivated.
stride (int, optional, defaults to 0) – If set to a number along with max_length, the overflowing tokens returned when return_overflowing_tokens=True will contain some tokens from the end of the truncated sequence returned to provide some overlap between truncated and overflowing sequences. The value of this argument defines the number of overlapping tokens.
is_split_into_words (bool, optional, defaults to False) – Whether or not the input is already pre-tokenized (e.g., split into words). If set to True, the tokenizer assumes the input is already split into words (for instance, by splitting it on whitespace) which it will tokenize. This is useful for NER or token classification.
pad_to_multiple_of (int, optional) – If set will pad the sequence to a multiple of the provided value. Requires padding to be activated. This is especially useful to enable the use of Tensor Cores on NVIDIA hardware with compute capability >= 7.5 (Volta).
padding_side (str, optional) – The side on which the model should have padding applied. Should be selected between [‘right’, ‘left’]. Default value is picked from the class attribute of the same name.
return_tensors (str or [~utils.TensorType], optional) –
If set, will return tensors instead of list of python integers. Acceptable values are:
- ’tf’: Return TensorFlow tf.constant objects.
- ’pt’: Return PyTorch torch.Tensor objects.
- ’np’: Return Numpy np.ndarray objects.
return_token_type_ids (bool, optional) –
Whether to return token type IDs. If left to the default, will return the token type IDs according to the specific tokenizer’s default, defined by the return_outputs attribute.

[What are token type IDs?](../glossary#token-type-ids)
return_attention_mask (bool, optional) –
Whether to return the attention mask. If left to the default, will return the attention mask according to the specific tokenizer’s default, defined by the return_outputs attribute.

[What are attention masks?](../glossary#attention-mask)
return_overflowing_tokens (bool, optional, defaults to False) – Whether or not to return overflowing token sequences. If a pair of sequences of input ids (or a batch of pairs) is provided with truncation_strategy = longest_first or True, an error is raised instead of returning overflowing tokens.
return_special_tokens_mask (bool, optional, defaults to False) – Whether or not to return special tokens mask information.
return_offsets_mapping (bool, optional, defaults to False) –
Whether or not to return (char_start, char_end) for each token.

This is only available on fast tokenizers inheriting from [PreTrainedTokenizerFast], if using Python’s tokenizer, this method will raise NotImplementedError.
return_length (bool, optional, defaults to False) – Whether or not to return the lengths of the encoded inputs.
verbose (bool, optional, defaults to True) – Whether or not to print more information and warnings.
**kwargs – passed to the self.tokenize() method

Returns:

A [BatchEncoding] with the following fields:

input_ids – List of token ids to be fed to a model.

[What are input IDs?](../glossary#input-ids)
token_type_ids – List of token type ids to be fed to a model (when return_token_type_ids=True or if “token_type_ids” is in self.model_input_names).

[What are token type IDs?](../glossary#token-type-ids)
attention_mask – List of indices specifying which tokens should be attended to by the model (when return_attention_mask=True or if “attention_mask” is in self.model_input_names).

[What are attention masks?](../glossary#attention-mask)
overflowing_tokens – List of overflowing tokens sequences (when a max_length is specified and return_overflowing_tokens=True).
num_truncated_tokens – Number of tokens truncated (when a max_length is specified and return_overflowing_tokens=True).
special_tokens_mask – List of 0s and 1s, with 1 specifying added special tokens and 0 specifying regular sequence tokens (when add_special_tokens=True and return_special_tokens_mask=True).
length – The length of the inputs (when return_length=True)

Return type:

[BatchEncoding]

featurize(datapoints: Iterable[Any], log_every_n: int = 1000, **kwargs) → ndarray[source]¶

Calculate features for datapoints.

Parameters:

datapoints (Iterable[Any]) – A sequence of objects that you’d like to featurize. Subclassses of Featurizer should instantiate the _featurize method that featurizes objects in the sequence.
log_every_n (int, default 1000) – Logs featurization progress every log_every_n steps.

Returns:

A numpy array containing a featurized representation of datapoints.

Return type:

np.ndarray

classmethod from_pretrained(pretrained_model_name_or_path: str | PathLike, *init_inputs, cache_dir: str | PathLike | None = None, force_download: bool = False, local_files_only: bool = False, token: bool | str | None = None, revision: str = 'main', trust_remote_code=False, **kwargs)[source]¶

Instantiate a [~tokenization_utils_base.PreTrainedTokenizerBase] (or a derived class) from a predefined tokenizer.

Parameters:

pretrained_model_name_or_path (str or os.PathLike) –
Can be either:
- A string, the model id of a predefined tokenizer hosted inside a model repo on huggingface.co.
- A path to a directory containing vocabulary files required by the tokenizer, for instance saved using the [~tokenization_utils_base.PreTrainedTokenizerBase.save_pretrained] method, e.g., ./my_model_directory/.
- (Deprecated, not applicable to all derived classes) A path or url to a single saved vocabulary file (if and only if the tokenizer only requires a single vocabulary file like Bert or XLNet), e.g., ./my_model_directory/vocab.txt.
cache_dir (str or os.PathLike, optional) – Path to a directory in which a downloaded predefined tokenizer vocabulary files should be cached if the standard cache should not be used.
force_download (bool, optional, defaults to False) – Whether or not to force the (re-)download the vocabulary files and override the cached versions if they exist.
resume_download – Deprecated and ignored. All downloads are now resumed by default when possible. Will be removed in v5 of Transformers.
proxies (dict[str, str], optional) – A dictionary of proxy servers to use by protocol or endpoint, e.g., {‘http’: ‘foo.bar:3128’, ‘http://hostname’: ‘foo.bar:4012’}. The proxies are used on each request.
token (str or bool, optional) – The token to use as HTTP bearer authorization for remote files. If True, will use the token generated when running hf auth login (stored in ~/.huggingface).
local_files_only (bool, optional, defaults to False) – Whether or not to only rely on local files and not to attempt to download any files.
revision (str, optional, defaults to “main”) – The specific model version to use. It can be a branch name, a tag name, or a commit id, since we use a git-based system for storing models and other artifacts on huggingface.co, so revision can be any identifier allowed by git.
subfolder (str, optional) – In case the relevant files are located inside a subfolder of the model repo on huggingface.co (e.g. for facebook/rag-token-base), specify it here.
inputs (additional positional arguments, optional) – Will be passed along to the Tokenizer __init__ method.
trust_remote_code (bool, optional, defaults to False) – Whether or not to allow for custom models defined on the Hub in their own modeling files. This option should only be set to True for repositories you trust and in which you have read the code, as it will execute code present on the Hub on your local machine.
kwargs (additional keyword arguments, optional) – Will be passed to the Tokenizer __init__ method. Can be used to set special tokens like bos_token, eos_token, unk_token, sep_token, pad_token, cls_token, mask_token, additional_special_tokens. See parameters in the __init__ for more details.

<Tip>

Passing token=True is required when you want to use a private model.

</Tip>

Examples:

```python # We can’t instantiate directly the base class PreTrainedTokenizerBase so let’s show our examples on a derived class: BertTokenizer # Download vocabulary from huggingface.co and cache. tokenizer = BertTokenizer.from_pretrained(“google-bert/bert-base-uncased”)

# Download vocabulary from huggingface.co (user-uploaded) and cache. tokenizer = BertTokenizer.from_pretrained(“dbmdz/bert-base-german-cased”)

# If vocabulary files are in a directory (e.g. tokenizer was saved using save_pretrained(‘./test/saved_model/’)) tokenizer = BertTokenizer.from_pretrained(“./test/saved_model/”)

# If the tokenizer uses a single vocabulary file, you can point directly to this file tokenizer = BertTokenizer.from_pretrained(“./test/saved_model/my_vocab.txt”)

# You can link tokens to special vocabulary when instantiating tokenizer = BertTokenizer.from_pretrained(“google-bert/bert-base-uncased”, unk_token=”<unk>”) # You should be sure ‘<unk>’ is in the vocabulary when doing that. # Otherwise use tokenizer.add_special_tokens({‘unk_token’: ‘<unk>’}) instead) assert tokenizer.unk_token == “<unk>” ```

get_added_vocab() → dict[str, int][source]¶

Returns the added tokens in the vocabulary as a dictionary of token to index.

Returns:: The added tokens.
Return type:: dict[str, int]

get_chat_template(chat_template: str | None = None, tools: list[dict] | None = None) → str[source]¶

Retrieve the chat template string used for tokenizing chat messages. This template is used internally by the apply_chat_template method and can also be used externally to retrieve the model’s chat template for better generation tracking.

Parameters:

chat_template (str, optional) – A Jinja template or the name of a template to use for this conversion. It is usually not necessary to pass anything to this argument, as the model’s template will be used by default.
tools (list[Dict], optional) – A list of tools (callable functions) that will be accessible to the model. If the template does not support function calling, this argument will have no effect. Each tool should be passed as a JSON Schema, giving the name, description and argument types for the tool. See our [chat templating guide](https://huggingface.co/docs/transformers/main/en/chat_templating#automated-function-conversion-for-tool-use) for more information.

Returns:

The chat template string.

Return type:

str

get_special_tokens_mask(token_ids_0: list[int], token_ids_1: list[int] | None = None, already_has_special_tokens: bool = False) → list[int][source]¶

Retrieves sequence ids from a token list that has no special tokens added. This method is called when adding special tokens using the tokenizer prepare_for_model or encode_plus methods.

Parameters:

token_ids_0 (list[int]) – List of ids of the first sequence.
token_ids_1 (list[int], optional) – List of ids of the second sequence.
already_has_special_tokens (bool, optional, defaults to False) – Whether or not the token list is already formatted with special tokens for the model.

Returns:

1 for a special token, 0 for a sequence token.

Return type:

A list of integers in the range [0, 1]

get_vocab() → dict[str, int][source]¶

Returns the vocabulary as a dictionary of token to index.

tokenizer.get_vocab()[token] is equivalent to tokenizer.convert_tokens_to_ids(token) when token is in the vocab.

Returns:: The vocabulary.
Return type:: dict[str, int]

property mask_token: str[source]¶

Mask token, to use when training a model with masked-language modeling. Log an error if used while not having been set.

Roberta tokenizer has a special mask token to be usable in the fill-mask pipeline. The mask token will greedily comprise the space before the <mask>.

Type:: str

property max_len_sentences_pair: int[source]¶

The maximum combined length of a pair of sentences that can be fed to the model.

Type:: int

property max_len_single_sentence: int[source]¶

The maximum length of a sentence that can be fed to the model.

Type:: int

num_special_tokens_to_add(pair: bool = False) → int[source]¶

Returns the number of added tokens when encoding a sequence with special tokens.

<Tip>

This encodes a dummy input and checks the number of added tokens, and is therefore not efficient. Do not put this inside your training loop.

</Tip>

Parameters:: pair (bool, optional, defaults to False) – Whether the number of added tokens should be computed in the case of a sequence pair or a single sequence.
Returns:: Number of special tokens added to sequences.
Return type:: int

Pad a single encoded input or a batch of encoded inputs up to predefined length or to the max sequence length in the batch.

Padding side (left/right) padding token ids are defined at the tokenizer level (with self.padding_side, self.pad_token_id and self.pad_token_type_id).

Please note that with a fast tokenizer, using the __call__ method is faster than using a method to encode the text followed by a call to the pad method to get a padded encoding.

<Tip>

If the encoded_inputs passed are dictionary of numpy arrays, PyTorch tensors or TensorFlow tensors, the result will use the same type unless you provide a different tensor type with return_tensors. In the case of PyTorch tensors, you will lose the specific device of your tensors however.

</Tip>

Parameters:

encoded_inputs ([BatchEncoding], list of [BatchEncoding], dict[str, list[int]], dict[str, list[list[int]] or list[dict[str, list[int]]]) –
Tokenized inputs. Can represent one input ([BatchEncoding] or dict[str, list[int]]) or a batch of tokenized inputs (list of [BatchEncoding], dict[str, list[list[int]]] or list[dict[str, list[int]]]) so you can use this method during preprocessing as well as in a PyTorch Dataloader collate function.

Instead of list[int] you can have tensors (numpy arrays, PyTorch tensors or TensorFlow tensors), see the note above for the return type.
padding (bool, str or [~utils.PaddingStrategy], optional, defaults to True) –

Select a strategy to pad the returned sequences (according to the model’s padding side and padding
index) among:
- True or ‘longest’ (default): Pad to the longest sequence in the batch (or no padding if only a single sequence if provided).
- ’max_length’: Pad to a maximum length specified with the argument max_length or to the maximum acceptable input length for the model if that argument is not provided.
- False or ‘do_not_pad’: No padding (i.e., can output a batch with sequences of different lengths).
max_length (int, optional) – Maximum length of the returned list and optionally padding length (see above).
pad_to_multiple_of (int, optional) –
If set will pad the sequence to a multiple of the provided value.

This is especially useful to enable the use of Tensor Cores on NVIDIA hardware with compute capability >= 7.5 (Volta).
padding_side (str, optional) – The side on which the model should have padding applied. Should be selected between [‘right’, ‘left’]. Default value is picked from the class attribute of the same name.
return_attention_mask (bool, optional) –
Whether to return the attention mask. If left to the default, will return the attention mask according to the specific tokenizer’s default, defined by the return_outputs attribute.

[What are attention masks?](../glossary#attention-mask)
return_tensors (str or [~utils.TensorType], optional) –
If set, will return tensors instead of list of python integers. Acceptable values are:
- ’tf’: Return TensorFlow tf.constant objects.
- ’pt’: Return PyTorch torch.Tensor objects.
- ’np’: Return Numpy np.ndarray objects.
verbose (bool, optional, defaults to True) – Whether or not to print more information and warnings.

property pad_token_type_id: int[source]¶

Id of the padding token type in the vocabulary.

Type:: int

prepare_for_model(ids: list[int], pair_ids: list[int] | None = None, add_special_tokens: bool = True, padding: bool | str | PaddingStrategy = False, truncation: bool | str | TruncationStrategy | None = None, max_length: int | None = None, stride: int = 0, pad_to_multiple_of: int | None = None, padding_side: str | None = None, return_tensors: str | TensorType | None = None, return_token_type_ids: bool | None = None, return_attention_mask: bool | None = None, return_overflowing_tokens: bool = False, return_special_tokens_mask: bool = False, return_offsets_mapping: bool = False, return_length: bool = False, verbose: bool = True, prepend_batch_axis: bool = False, **kwargs) → BatchEncoding[source]¶

Prepares a sequence of input id, or a pair of sequences of inputs ids so that it can be used by the model. It adds special tokens, truncates sequences if overflowing while taking into account the special tokens and manages a moving window (with user defined stride) for overflowing tokens. Please Note, for pair_ids different than None and truncation_strategy = longest_first or True, it is not possible to return overflowing tokens. Such a combination of arguments will raise an error.

Parameters:

ids (list[int]) – Tokenized input ids of the first sequence. Can be obtained from a string by chaining the tokenize and convert_tokens_to_ids methods.
pair_ids (list[int], optional) – Tokenized input ids of the second sequence. Can be obtained from a string by chaining the tokenize and convert_tokens_to_ids methods.
add_special_tokens (bool, optional, defaults to True) – Whether or not to add special tokens when encoding the sequences. This will use the underlying PretrainedTokenizerBase.build_inputs_with_special_tokens function, which defines which tokens are automatically added to the input ids. This is useful if you want to add bos or eos tokens automatically.
padding (bool, str or [~utils.PaddingStrategy], optional, defaults to False) –
Activates and controls padding. Accepts the following values:
- True or ‘longest’: Pad to the longest sequence in the batch (or no padding if only a single sequence is provided).
- ’max_length’: Pad to a maximum length specified with the argument max_length or to the maximum acceptable input length for the model if that argument is not provided.
- False or ‘do_not_pad’ (default): No padding (i.e., can output a batch with sequences of different lengths).
truncation (bool, str or [~tokenization_utils_base.TruncationStrategy], optional, defaults to False) –
Activates and controls truncation. Accepts the following values:
- True or ‘longest_first’: Truncate to a maximum length specified with the argument max_length or to the maximum acceptable input length for the model if that argument is not provided. This will truncate token by token, removing a token from the longest sequence in the pair if a pair of sequences (or a batch of pairs) is provided.
- ’only_first’: Truncate to a maximum length specified with the argument max_length or to the maximum acceptable input length for the model if that argument is not provided. This will only truncate the first sequence of a pair if a pair of sequences (or a batch of pairs) is provided.
- ’only_second’: Truncate to a maximum length specified with the argument max_length or to the maximum acceptable input length for the model if that argument is not provided. This will only truncate the second sequence of a pair if a pair of sequences (or a batch of pairs) is provided.
- False or ‘do_not_truncate’ (default): No truncation (i.e., can output batch with sequence lengths greater than the model maximum admissible input size).
max_length (int, optional) –
Controls the maximum length to use by one of the truncation/padding parameters.

If left unset or set to None, this will use the predefined model maximum length if a maximum length is required by one of the truncation/padding parameters. If the model has no specific maximum input length (like XLNet) truncation/padding to a maximum length will be deactivated.
stride (int, optional, defaults to 0) – If set to a number along with max_length, the overflowing tokens returned when return_overflowing_tokens=True will contain some tokens from the end of the truncated sequence returned to provide some overlap between truncated and overflowing sequences. The value of this argument defines the number of overlapping tokens.
is_split_into_words (bool, optional, defaults to False) – Whether or not the input is already pre-tokenized (e.g., split into words). If set to True, the tokenizer assumes the input is already split into words (for instance, by splitting it on whitespace) which it will tokenize. This is useful for NER or token classification.
pad_to_multiple_of (int, optional) – If set will pad the sequence to a multiple of the provided value. Requires padding to be activated. This is especially useful to enable the use of Tensor Cores on NVIDIA hardware with compute capability >= 7.5 (Volta).
padding_side (str, optional) – The side on which the model should have padding applied. Should be selected between [‘right’, ‘left’]. Default value is picked from the class attribute of the same name.
return_tensors (str or [~utils.TensorType], optional) –
If set, will return tensors instead of list of python integers. Acceptable values are:
- ’tf’: Return TensorFlow tf.constant objects.
- ’pt’: Return PyTorch torch.Tensor objects.
- ’np’: Return Numpy np.ndarray objects.
return_token_type_ids (bool, optional) –
Whether to return token type IDs. If left to the default, will return the token type IDs according to the specific tokenizer’s default, defined by the return_outputs attribute.

[What are token type IDs?](../glossary#token-type-ids)
return_attention_mask (bool, optional) –
Whether to return the attention mask. If left to the default, will return the attention mask according to the specific tokenizer’s default, defined by the return_outputs attribute.

[What are attention masks?](../glossary#attention-mask)
return_overflowing_tokens (bool, optional, defaults to False) – Whether or not to return overflowing token sequences. If a pair of sequences of input ids (or a batch of pairs) is provided with truncation_strategy = longest_first or True, an error is raised instead of returning overflowing tokens.
return_special_tokens_mask (bool, optional, defaults to False) – Whether or not to return special tokens mask information.
return_offsets_mapping (bool, optional, defaults to False) –
Whether or not to return (char_start, char_end) for each token.

This is only available on fast tokenizers inheriting from [PreTrainedTokenizerFast], if using Python’s tokenizer, this method will raise NotImplementedError.
return_length (bool, optional, defaults to False) – Whether or not to return the lengths of the encoded inputs.
verbose (bool, optional, defaults to True) – Whether or not to print more information and warnings.
**kwargs – passed to the self.tokenize() method

Returns:

A [BatchEncoding] with the following fields:

input_ids – List of token ids to be fed to a model.

[What are input IDs?](../glossary#input-ids)
token_type_ids – List of token type ids to be fed to a model (when return_token_type_ids=True or if “token_type_ids” is in self.model_input_names).

[What are token type IDs?](../glossary#token-type-ids)
attention_mask – List of indices specifying which tokens should be attended to by the model (when return_attention_mask=True or if “attention_mask” is in self.model_input_names).

[What are attention masks?](../glossary#attention-mask)
overflowing_tokens – List of overflowing tokens sequences (when a max_length is specified and return_overflowing_tokens=True).
num_truncated_tokens – Number of tokens truncated (when a max_length is specified and return_overflowing_tokens=True).
special_tokens_mask – List of 0s and 1s, with 1 specifying added special tokens and 0 specifying regular sequence tokens (when add_special_tokens=True and return_special_tokens_mask=True).
length – The length of the inputs (when return_length=True)

Return type:

[BatchEncoding]

prepare_seq2seq_batch(src_texts: list[str], tgt_texts: list[str] | None = None, max_length: int | None = None, max_target_length: int | None = None, padding: str = 'longest', return_tensors: str | None = None, truncation: bool = True, **kwargs) → BatchEncoding[source]¶

Prepare model inputs for translation. For best performance, translate one sentence at a time.

Parameters:

src_texts (list[str]) – List of documents to summarize or source language texts.
tgt_texts (list, optional) – List of summaries or target language texts.
max_length (int, optional) – Controls the maximum length for encoder inputs (documents to summarize or source language texts) If left unset or set to None, this will use the predefined model maximum length if a maximum length is required by one of the truncation/padding parameters. If the model has no specific maximum input length (like XLNet) truncation/padding to a maximum length will be deactivated.
max_target_length (int, optional) – Controls the maximum length of decoder inputs (target language texts or summaries) If left unset or set to None, this will use the max_length value.
padding (bool, str or [~utils.PaddingStrategy], optional, defaults to False) –
Activates and controls padding. Accepts the following values:
- True or ‘longest’: Pad to the longest sequence in the batch (or no padding if only a single sequence if provided).
- ’max_length’: Pad to a maximum length specified with the argument max_length or to the maximum acceptable input length for the model if that argument is not provided.
- False or ‘do_not_pad’ (default): No padding (i.e., can output a batch with sequences of different lengths).
return_tensors (str or [~utils.TensorType], optional) –
If set, will return tensors instead of list of python integers. Acceptable values are:
- ’tf’: Return TensorFlow tf.constant objects.
- ’pt’: Return PyTorch torch.Tensor objects.
- ’np’: Return Numpy np.ndarray objects.
truncation (bool, str or [~tokenization_utils_base.TruncationStrategy], optional, defaults to True) –
Activates and controls truncation. Accepts the following values:
- True or ‘longest_first’: Truncate to a maximum length specified with the argument max_length or to the maximum acceptable input length for the model if that argument is not provided. This will truncate token by token, removing a token from the longest sequence in the pair if a pair of sequences (or a batch of pairs) is provided.
- ’only_first’: Truncate to a maximum length specified with the argument max_length or to the maximum acceptable input length for the model if that argument is not provided. This will only truncate the first sequence of a pair if a pair of sequences (or a batch of pairs) is provided.
- ’only_second’: Truncate to a maximum length specified with the argument max_length or to the maximum acceptable input length for the model if that argument is not provided. This will only truncate the second sequence of a pair if a pair of sequences (or a batch of pairs) is provided.
- False or ‘do_not_truncate’ (default): No truncation (i.e., can output batch with sequence lengths greater than the model maximum admissible input size).
**kwargs – Additional keyword arguments passed along to self.__call__.

Returns:

A [BatchEncoding] with the following fields:

input_ids – List of token ids to be fed to the encoder.
attention_mask – List of indices specifying which tokens should be attended to by the model.
labels – List of token ids for tgt_texts.

The full set of keys [input_ids, attention_mask, labels], will only be returned if tgt_texts is passed. Otherwise, input_ids, attention_mask will be the only keys.

Return type:

[BatchEncoding]

Upload the tokenizer files to the 🤗 Model Hub.

Parameters:

repo_id (str) – The name of the repository you want to push your tokenizer to. It should contain your organization name when pushing to a given organization.
use_temp_dir (bool, optional) – Whether or not to use a temporary directory to store the files saved before they are pushed to the Hub. Will default to True if there is no directory named like repo_id, False otherwise.
commit_message (str, optional) – Message to commit while pushing. Will default to “Upload tokenizer”.
private (bool, optional) – Whether to make the repo private. If None (default), the repo will be public unless the organization’s default is private. This value is ignored if the repo already exists.
token (bool or str, optional) – The token to use as HTTP bearer authorization for remote files. If True, will use the token generated when running hf auth login (stored in ~/.huggingface). Will default to True if repo_url is not specified.
max_shard_size (int or str, optional, defaults to “5GB”) – Only applicable for models. The maximum size for a checkpoint before being sharded. Checkpoints shard will then be each of size lower than this size. If expressed as a string, needs to be digits followed by a unit (like “5MB”). We default it to “5GB” so that users can easily load models on free-tier Google Colab instances without any CPU OOM issues.
create_pr (bool, optional, defaults to False) – Whether or not to create a PR with the uploaded files or directly commit.
safe_serialization (bool, optional, defaults to True) – Whether or not to convert the model weights in safetensors format for safer serialization.
revision (str, optional) – Branch to push the uploaded files to.
commit_description (str, optional) – The description of the commit that will be created
tags (list[str], optional) – List of tags to push on the Hub.

Examples:

```python from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(“google-bert/bert-base-cased”)

# Push the tokenizer to your namespace with the name “my-finetuned-bert”. tokenizer.push_to_hub(“my-finetuned-bert”)

# Push the tokenizer to an organization with the name “my-finetuned-bert”. tokenizer.push_to_hub(“huggingface/my-finetuned-bert”) ```

classmethod register_for_auto_class(auto_class='AutoTokenizer')[source]¶

Register this class with a given auto class. This should only be used for custom tokenizers as the ones in the library are already mapped with AutoTokenizer.

Parameters:: auto_class (str or type, optional, defaults to “AutoTokenizer”) – The auto class to register this new tokenizer with.

sanitize_special_tokens() → int[source]¶: The sanitize_special_tokens is now deprecated kept for backward compatibility and will be removed in transformers v5.

save_chat_templates(save_directory: str | PathLike, tokenizer_config: dict, filename_prefix: str | None, save_jinja_files: bool)[source]¶: Writes chat templates out to the save directory if we’re using the new format, and removes them from the tokenizer config if present. If we’re using the legacy format, it doesn’t write any files, and instead writes the templates to the tokenizer config in the correct format.

save_pretrained(save_directory: str | PathLike, legacy_format: bool | None = None, filename_prefix: str | None = None, push_to_hub: bool = False, **kwargs) → tuple[str, ...][source]¶

Save the full tokenizer state.

This method make sure the full tokenizer can then be re-loaded using the [~tokenization_utils_base.PreTrainedTokenizer.from_pretrained] class method..

Warning,None This won’t save modifications you may have applied to the tokenizer after the instantiation (for instance, modifying tokenizer.do_lower_case after creation).

Parameters:

save_directory (str or os.PathLike) – The path to a directory where the tokenizer will be saved.
legacy_format (bool, optional) –
Only applicable for a fast tokenizer. If unset (default), will save the tokenizer in the unified JSON format as well as in legacy format if it exists, i.e. with tokenizer specific vocabulary and a separate added_tokens files.

If False, will only save the tokenizer in the unified JSON format. This format is incompatible with “slow” tokenizers (not powered by the tokenizers library), so the tokenizer will not be able to be loaded in the corresponding “slow” tokenizer.

If True, will save the tokenizer in legacy format. If the “slow” tokenizer doesn’t exits, a value error is raised.
filename_prefix (str, optional) – A prefix to add to the names of the files saved by the tokenizer.
push_to_hub (bool, optional, defaults to False) – Whether or not to push your model to the Hugging Face model hub after saving it. You can specify the repository you want to push to with repo_id (will default to the name of save_directory in your namespace).
kwargs (dict[str, Any], optional) – Additional key word arguments passed along to the [~utils.PushToHubMixin.push_to_hub] method.

Returns:

The files saved.

Return type:

A tuple of str

save_vocabulary(save_directory: str, filename_prefix: str | None = None) → tuple[str][source]¶

Save only the vocabulary of the tokenizer (vocabulary + added tokens).

This method won’t save the configuration and special token mappings of the tokenizer. Use [~PreTrainedTokenizerFast._save_pretrained] to save the whole state of the tokenizer.

Parameters:

save_directory (str) – The directory in which to save the vocabulary.
filename_prefix (str, optional) – An optional prefix to add to the named of the saved files.

Returns:

Paths to the files saved.

Return type:

tuple(str)

set_truncation_and_padding(padding_strategy: PaddingStrategy, truncation_strategy: TruncationStrategy, max_length: int, stride: int, pad_to_multiple_of: int | None, padding_side: str | None)[source]¶

Define the truncation and the padding strategies for fast tokenizers (provided by HuggingFace tokenizers library) and restore the tokenizer settings afterwards.

The provided tokenizer has no padding / truncation strategy before the managed section. If your tokenizer set a padding / truncation strategy before, then it will be reset to no padding / truncation when exiting the managed section.

Parameters:

padding_strategy ([~utils.PaddingStrategy]) – The kind of padding that will be applied to the input
truncation_strategy ([~tokenization_utils_base.TruncationStrategy]) – The kind of truncation that will be applied to the input
max_length (int) – The maximum size of a sequence.
stride (int) – The stride to use when handling overflow.
pad_to_multiple_of (int, optional) – If set will pad the sequence to a multiple of the provided value. This is especially useful to enable the use of Tensor Cores on NVIDIA hardware with compute capability >= 7.5 (Volta).
padding_side (str, optional) – The side on which the model should have padding applied. Should be selected between [‘right’, ‘left’]. Default value is picked from the class attribute of the same name.

slow_tokenizer_class[source]¶: alias of RobertaTokenizer

property special_tokens_map: dict[str, str | list[str]][source]¶

A dictionary mapping special token class attributes (cls_token, unk_token, etc.) to their values (‘<unk>’, ‘<cls>’, etc.).

Convert potential tokens of tokenizers.AddedToken type to string.

Type:: dict[str, Union[str, list[str]]]

property special_tokens_map_extended: dict[str, str | AddedToken | list[str | AddedToken]][source]¶

A dictionary mapping special token class attributes (cls_token, unk_token, etc.) to their values (‘<unk>’, ‘<cls>’, etc.).

Don’t convert tokens of tokenizers.AddedToken type to string so they can be used to control more finely how special tokens are tokenized.

Type:: dict[str, Union[str, tokenizers.AddedToken, list[Union[str, tokenizers.AddedToken]]]]

tokenize(text: str, pair: str | None = None, add_special_tokens: bool = False, **kwargs) → list[str][source]¶

Converts a string into a sequence of tokens, replacing unknown tokens with the unk_token.

Parameters:

text (str) – The sequence to be encoded.
pair (str, optional) – A second sequence to be encoded with the first.
add_special_tokens (bool, optional, defaults to False) – Whether or not to add the special tokens associated with the corresponding model.
kwargs (additional keyword arguments, optional) – Will be passed to the underlying model specific encode method. See details in [~PreTrainedTokenizerBase.__call__]

Returns:

The list of tokens.

Return type:

list[str]

train_new_from_iterator(text_iterator, vocab_size, length=None, new_special_tokens=None, special_tokens_map=None, **kwargs)[source]¶

Trains a tokenizer on a new corpus with the same defaults (in terms of special tokens or tokenization pipeline) as the current one.

Parameters:

text_iterator (generator of list[str]) – The training corpus. Should be a generator of batches of texts, for instance a list of lists of texts if you have everything in memory.
vocab_size (int) – The size of the vocabulary you want for your tokenizer.
length (int, optional) – The total number of sequences in the iterator. This is used to provide meaningful progress tracking
new_special_tokens (list of str or AddedToken, optional) – A list of new special tokens to add to the tokenizer you are training.
special_tokens_map (dict[str, str], optional) – If you want to rename some of the special tokens this tokenizer uses, pass along a mapping old special token name to new special token name in this argument.
kwargs (dict[str, Any], optional) – Additional keyword arguments passed along to the trainer from the 🤗 Tokenizers library.

Returns:

A new tokenizer of the same type as the original one, trained on text_iterator.

Return type:

[PreTrainedTokenizerFast]

truncate_sequences(ids: list[int], pair_ids: list[int] | None = None, num_tokens_to_remove: int = 0, truncation_strategy: str | TruncationStrategy = 'longest_first', stride: int = 0) → tuple[list[int], list[int], list[int]][source]¶

Truncates a sequence pair in-place following the strategy.

Parameters:

ids (list[int]) – Tokenized input ids of the first sequence. Can be obtained from a string by chaining the tokenize and convert_tokens_to_ids methods.
pair_ids (list[int], optional) – Tokenized input ids of the second sequence. Can be obtained from a string by chaining the tokenize and convert_tokens_to_ids methods.
num_tokens_to_remove (int, optional, defaults to 0) – Number of tokens to remove using the truncation strategy.
truncation_strategy (str or [~tokenization_utils_base.TruncationStrategy], optional, defaults to ‘longest_first’) –
The strategy to follow for truncation. Can be:
- ’longest_first’: Truncate to a maximum length specified with the argument max_length or to the maximum acceptable input length for the model if that argument is not provided. This will truncate token by token, removing a token from the longest sequence in the pair if a pair of sequences (or a batch of pairs) is provided.
- ’only_first’: Truncate to a maximum length specified with the argument max_length or to the maximum acceptable input length for the model if that argument is not provided. This will only truncate the first sequence of a pair if a pair of sequences (or a batch of pairs) is provided.
- ’only_second’: Truncate to a maximum length specified with the argument max_length or to the maximum acceptable input length for the model if that argument is not provided. This will only truncate the second sequence of a pair if a pair of sequences (or a batch of pairs) is provided.
- ’do_not_truncate’ (default): No truncation (i.e., can output batch with sequence lengths greater than the model maximum admissible input size).
stride (int, optional, defaults to 0) – If set to a positive number, the overflowing tokens returned will contain some tokens from the main sequence returned. The value of this argument defines the number of additional tokens.

Returns:

The truncated ids, the truncated pair_ids and the list of overflowing tokens. Note: The longest_first strategy returns empty list of overflowing tokens if a pair of sequences (or a batch of pairs) is provided.

Return type:

tuple[list[int], list[int], list[int]]

property vocab_size: int[source]¶

Size of the base vocabulary (without the added tokens).

Type:: int

RxnFeaturizer ¶

class RxnFeaturizer(tokenizer: RobertaTokenizerFast, sep_reagent: bool, max_length: int = 100)[source]¶

Reaction Featurizer.

RxnFeaturizer is a wrapper class for HuggingFace’s RobertaTokenizerFast, that is intended for featurizing chemical reaction datasets. The featurizer computes the source and target required for a seq2seq task and applies the RobertaTokenizer on them separately. Additionally, it can also separate or mix the reactants and reagents before tokenizing.

Examples

>>> from deepchem.feat import RxnFeaturizer
>>> from transformers import RobertaTokenizerFast
>>> tokenizer = RobertaTokenizerFast.from_pretrained("seyonec/PubChem10M_SMILES_BPE_450k")
>>> featurizer = RxnFeaturizer(tokenizer, sep_reagent=True)
>>> feats = featurizer.featurize(['CCS(=O)(=O)Cl.OCCBr>CCN(CC)CC.CCOCC>CCS(=O)(=O)OCCBr'])

Notes

The featurize method expects a List of reactions.
Use the sep_reagent toggle to enable/disable reagent separation.
- True - Separate the reactants and reagents
- False - Mix the reactants and reagents

__init__(tokenizer: RobertaTokenizerFast, sep_reagent: bool, max_length: int = 100)[source]¶

Initialize a ReactionFeaturizer object.

Parameters:

tokenizer (RobertaTokenizerFast) – HuggingFace Tokenizer to be used for featurization.
sep_reagent (bool) – Toggle to separate or mix the reactants and reagents.
max_length (int, default 100) – Maximum length of padding

featurize(datapoints: Iterable[Any], log_every_n: int = 1000, **kwargs) → ndarray[source]¶

Calculate features for datapoints.

Parameters:

datapoints (Iterable[Any]) – A sequence of objects that you’d like to featurize. Subclassses of Featurizer should instantiate the _featurize method that featurizes objects in the sequence.
log_every_n (int, default 1000) – Logs featurization progress every log_every_n steps.

Returns:

A numpy array containing a featurized representation of datapoints.

Return type:

np.ndarray

BindingPocketFeaturizer ¶

class BindingPocketFeaturizer[source]¶

Featurizes binding pockets with information about chemical environments.

In many applications, it’s desirable to look at binding pockets on macromolecules which may be good targets for potential ligands or other molecules to interact with. A BindingPocketFeaturizer expects to be given a macromolecule, and a list of pockets to featurize on that macromolecule. These pockets should be of the form produced by a dc.dock.BindingPocketFinder, that is as a list of dc.utils.CoordinateBox objects.

The base featurization in this class’s featurization is currently very simple and counts the number of residues of each type present in the pocket. It’s likely that you’ll want to overwrite this implementation for more sophisticated downstream usecases. Note that this class’s implementation will only work for proteins and not for other macromolecules

Note

This class requires mdtraj to be installed.

featurize(protein_file: str, pockets: List[CoordinateBox]) → ndarray[source]¶

Calculate atomic coodinates.

Parameters:

protein_file (str) – Location of PDB file. Will be loaded by MDTraj
pockets (List[CoordinateBox]) – List of dc.utils.CoordinateBox objects.

Returns:

A numpy array of shale (len(pockets), n_residues)

Return type:

np.ndarray

ProteinBackboneFeaturizer ¶

This featurizer reads only the first model from a multi-model PDB, skips standard residues missing any of N, CA, or C, and center-crops overlength proteins with a warning when max_length is set.

class ProteinBackboneFeaturizer(max_length: int | None = 512)[source]¶

Featurizes protein structures by extracting backbone atom coordinates.

This featurizer extracts the N, CA, and C backbone atom coordinates from PDB structures. It is designed for use with protein structure generation models like RFDiffusion that operate on backbone geometry.

Each protein is featurized as an array of shape (L, 3, 3) where:

L is the number of residues
The second dimension corresponds to [N, CA, C] atoms
The third dimension contains [x, y, z] coordinates in Angstroms

Since proteins have variable lengths, calling featurize() returns a numpy object array where each element is an (L, 3, 3) array. For multi-model PDB files only the first model is featurized. Standard amino-acid residues missing any of N, CA, or C are skipped.

This class requires BioPython to be installed.

Parameters:: max_length (int or None, default 512) – Maximum number of residues to extract. Longer proteins will be center-cropped with a warning, which means the middle max_length residues are kept and residues from both ends are dropped. This is useful for RFDiffusion-style models that need a fixed upper bound on sequence length while still keeping the central part of the fold.

Examples

>>> import deepchem as dc
>>> import os
>>> current_dir = os.path.dirname(os.path.abspath(dc.feat.__file__))
>>> pdb_file = os.path.join(current_dir, 'tests', 'data',
...                         '3zso_protein_noH.pdb')
>>> featurizer = dc.feat.ProteinBackboneFeaturizer()
>>> features = featurizer.featurize([pdb_file])
>>> features.shape[0]
1
>>> features[0].shape[-2:]
(3, 3)

References

__init__(max_length: int | None = 512)[source]¶

Initialize the ProteinBackboneFeaturizer.

Parameters:: max_length (int or None, default 512) – Maximum number of residues to extract. Longer proteins will be center-cropped with a warning, which means the middle max_length residues are kept and residues from both ends are dropped. This keeps a consistent maximum size for diffusion training without always biasing the crop toward the N-terminus or C-terminus. If None, no length limit is applied.

get_metadata(datapoint: str) → Dict[str, Any][source]¶

Return metadata recorded during the most recent featurization.

Parameters:: datapoint (str) – Path to a PDB file that was previously featurized.
Returns:: Metadata including original length, skipped residues, chains, model id, and truncation details. Returns an empty dict if the datapoint has not been featurized by this instance.
Return type:: dict

Examples

>>> import deepchem as dc
>>> import os
>>> current_dir = os.path.dirname(os.path.abspath(dc.feat.__file__))
>>> pdb_file = os.path.join(current_dir, 'tests', 'data',
...                         '3zso_protein_noH.pdb')
>>> featurizer = dc.feat.ProteinBackboneFeaturizer()
>>> _ = featurizer.featurize([pdb_file])
>>> meta = featurizer.get_metadata(pdb_file)
>>> meta['returned_length'] > 0
True

featurize(datapoints: Iterable[Any], log_every_n: int = 1000, **kwargs) → ndarray[source]¶

Calculate features for datapoints.

Parameters:

datapoints (Iterable[Any]) – A sequence of objects that you’d like to featurize. Subclassses of Featurizer should instantiate the _featurize method that featurizes objects in the sequence.
log_every_n (int, default 1000) – Logs featurization progress every log_every_n steps.

Returns:

A numpy array containing a featurized representation of datapoints.

Return type:

np.ndarray

UserDefinedFeaturizer ¶

class UserDefinedFeaturizer(feature_fields)[source]¶

Directs usage of user-computed featurizations.

__init__(feature_fields)[source]¶: Creates user-defined-featurizer.

featurize(datapoints: Iterable[Any], log_every_n: int = 1000, **kwargs) → ndarray[source]¶

Calculate features for datapoints.

Parameters:

datapoints (Iterable[Any]) – A sequence of objects that you’d like to featurize. Subclassses of Featurizer should instantiate the _featurize method that featurizes objects in the sequence.
log_every_n (int, default 1000) – Logs featurization progress every log_every_n steps.

Returns:

A numpy array containing a featurized representation of datapoints.

Return type:

np.ndarray

DummyFeaturizer ¶

class DummyFeaturizer[source]¶

Class that implements a no-op featurization. This is useful when the raw dataset has to be used without featurizing the examples. The Molnet loader requires a featurizer input and such datasets can be used in their original form by passing the raw featurizer.

Examples

>>> import deepchem as dc
>>> smi_map = [["N#C[S-].O=C(CBr)c1ccc(C(F)(F)F)cc1>CCO.[K+]", "N#CSCC(=O)c1ccc(C(F)(F)F)cc1"], ["C1COCCN1.FCC(Br)c1cccc(Br)n1>CCN(C(C)C)C(C)C.CN(C)C=O.O", "FCC(c1cccc(Br)n1)N1CCOCC1"]]
>>> Featurizer = dc.feat.DummyFeaturizer()
>>> smi_feat = Featurizer.featurize(smi_map)
>>> smi_feat
array([['N#C[S-].O=C(CBr)c1ccc(C(F)(F)F)cc1>CCO.[K+]',
        'N#CSCC(=O)c1ccc(C(F)(F)F)cc1'],
       ['C1COCCN1.FCC(Br)c1cccc(Br)n1>CCN(C(C)C)C(C)C.CN(C)C=O.O',
        'FCC(c1cccc(Br)n1)N1CCOCC1']], dtype='<U55')

featurize(datapoints: Iterable[Any], log_every_n: int = 1000, **kwargs) → ndarray[source]¶

Passes through dataset, and returns the datapoint.

Parameters:: datapoints (Iterable[Any]) – A sequence of objects that you’d like to featurize.
Returns:: datapoints – A numpy array containing a featurized representation of the datapoints.
Return type:: np.ndarray

DNABertFeaturizer ¶

class DNABertFeaturizer(**kwargs)[source]¶

DNABERT-2 Featurizer.

The DNABertFeaturizer is a wrapper class of the PreTrainedTokenizerFast, which is used by Huggingface’s transformers library for tokenizing DNA sequences for DNABERT-2 Models. Please see https://github.com/huggingface/transformers and https://github.com/Zhihan1996/DNABERT_2 for more details.

Examples

>>> from deepchem.feat.dnabert_tokenizer import DNABertFeaturizer
>>> sequences = ["ACGTACGT", "GGGTTTCCC"]
>>> featurizer = DNABertFeaturizer.from_pretrained("zhihan1996/DNABERT-2-117M", trust_remote_code=True)
>>> out = featurizer(sequences, add_special_tokens=True, truncation=True)

Note

This class requires transformers to be installed. DNABertFeaturizer uses dual inheritance with PreTrainedTokenizerFast in Huggingface for rapid tokenization, as well as DeepChem’s Featurizer class.

__init__(**kwargs)[source]¶

__bool__() → bool[source]¶: Returns True, to avoid expensive assert tokenizer gotchas.

__len__() → int[source]¶: Size of the full vocabulary with the added tokens.

__setattr__(key, value)[source]¶: Implement setattr(self, name, value).

add_special_tokens(special_tokens_dict: dict[str, str | AddedToken | Sequence[str | AddedToken]], replace_additional_special_tokens=True) → int[source]¶

Add a dictionary of special tokens (eos, pad, cls, etc.) to the encoder and link them to class attributes. If special tokens are NOT in the vocabulary, they are added to it (indexed starting from the last index of the current vocabulary).

When adding new tokens to the vocabulary, you should make sure to also resize the token embedding matrix of the model so that its embedding matrix matches the tokenizer.

In order to do that, please use the [~PreTrainedModel.resize_token_embeddings] method.

Using add_special_tokens will ensure your special tokens can be used in several ways:

Special tokens can be skipped when decoding using skip_special_tokens = True.
Special tokens are carefully handled by the tokenizer (they are never split), similar to AddedTokens.
You can easily refer to special tokens using tokenizer class attributes like tokenizer.cls_token. This makes it easy to develop model-agnostic training and fine-tuning scripts.

When possible, special tokens are already registered for provided pretrained models (for instance [BertTokenizer] cls_token is already registered to be ‘[CLS]’ and XLM’s one is also registered to be ‘</s>’).

Parameters:

special_tokens_dict (dictionary str to str, tokenizers.AddedToken, or Sequence[Union[str, AddedToken]]) –
Keys should be in the list of predefined special attributes: [bos_token, eos_token, unk_token, sep_token, pad_token, cls_token, mask_token, additional_special_tokens].

Tokens are only added if they are not already in the vocabulary (tested by checking if the tokenizer assign the index of the unk_token to them).
replace_additional_special_tokens (bool, optional,, defaults to True) – If True, the existing list of additional special tokens will be replaced by the list provided in special_tokens_dict. Otherwise, self._special_tokens_map[“additional_special_tokens”] is just extended. In the former case, the tokens will NOT be removed from the tokenizer’s full vocabulary - they are only being flagged as non-special tokens. Remember, this only affects which tokens are skipped during decoding, not the added_tokens_encoder and added_tokens_decoder. This means that the previous additional_special_tokens are still added tokens, and will not be split by the model.

Returns:

Number of tokens added to the vocabulary.

Return type:

int

Examples:

```python # Let’s see how to add a new classification token to GPT-2 tokenizer = GPT2Tokenizer.from_pretrained(“openai-community/gpt2”) model = GPT2Model.from_pretrained(“openai-community/gpt2”)

special_tokens_dict = {“cls_token”: “<CLS>”}

num_added_toks = tokenizer.add_special_tokens(special_tokens_dict) print(“We have added”, num_added_toks, “tokens”) # Notice: resize_token_embeddings expect to receive the full size of the new vocabulary, i.e., the length of the tokenizer. model.resize_token_embeddings(len(tokenizer))

assert tokenizer.cls_token == “<CLS>” ```

add_tokens(new_tokens: str | AddedToken | Sequence[str | AddedToken], special_tokens: bool = False) → int[source]¶

Add a list of new tokens to the tokenizer class. If the new tokens are not in the vocabulary, they are added to it with indices starting from length of the current vocabulary and will be isolated before the tokenization algorithm is applied. Added tokens and tokens from the vocabulary of the tokenization algorithm are therefore not treated in the same way.

Note, when adding new tokens to the vocabulary, you should make sure to also resize the token embedding matrix of the model so that its embedding matrix matches the tokenizer.

In order to do that, please use the [~PreTrainedModel.resize_token_embeddings] method.

Parameters:

new_tokens (str, tokenizers.AddedToken or a sequence of str or tokenizers.AddedToken) – Tokens are only added if they are not already in the vocabulary. tokenizers.AddedToken wraps a string token to let you personalize its behavior: whether this token should only match against a single word, whether this token should strip all potential whitespaces on the left side, whether this token should strip all potential whitespaces on the right side, etc.
special_tokens (bool, optional, defaults to False) –
Can be used to specify if the token is a special token. This mostly change the normalization behavior (special tokens like CLS or [MASK] are usually not lower-cased for instance).

See details for tokenizers.AddedToken in HuggingFace tokenizers library.

Returns:

Number of tokens added to the vocabulary.

Return type:

int

Examples:

```python # Let’s see how to increase the vocabulary of Bert model and tokenizer tokenizer = BertTokenizerFast.from_pretrained(“google-bert/bert-base-uncased”) model = BertModel.from_pretrained(“google-bert/bert-base-uncased”)

num_added_toks = tokenizer.add_tokens([“new_tok1”, “my_new-tok2”]) print(“We have added”, num_added_toks, “tokens”) # Notice: resize_token_embeddings expect to receive the full size of the new vocabulary, i.e., the length of the tokenizer. model.resize_token_embeddings(len(tokenizer)) ```

property added_tokens_decoder: dict[int, AddedToken][source]¶

Returns the added tokens in the vocabulary as a dictionary of index to AddedToken.

Returns:: The added tokens.
Return type:: dict[str, int]

property added_tokens_encoder: dict[str, int][source]¶: Returns the sorted mapping from string to index. The added tokens encoder is cached for performance optimisation in self._added_tokens_encoder for the slow tokenizers.

property all_special_ids: list[int][source]¶

List the ids of the special tokens(‘<unk>’, ‘<cls>’, etc.) mapped to class attributes.

Type:: list[int]

property all_special_tokens: list[str][source]¶

A list of the unique special tokens (‘<unk>’, ‘<cls>’, …, etc.).

Convert tokens of tokenizers.AddedToken type to string.

Type:: list[str]

property all_special_tokens_extended: list[str | AddedToken][source]¶

All the special tokens (‘<unk>’, ‘<cls>’, etc.), the order has nothing to do with the index of each tokens. If you want to know the correct indices, check self.added_tokens_encoder. We can’t create an order anymore as the keys are AddedTokens and not Strings.

Don’t convert tokens of tokenizers.AddedToken type to string so they can be used to control more finely how special tokens are tokenized.

Type:: list[Union[str, tokenizers.AddedToken]]

Converts a list of dictionaries with “role” and “content” keys to a list of token ids. This method is intended for use with chat models, and will read the tokenizer’s chat_template attribute to determine the format and control tokens to use when converting.

Parameters:

conversation (Union[list[dict[str, str]], list[list[dict[str, str]]]]) – A list of dicts with “role” and “content” keys, representing the chat history so far.
tools (list[Union[Dict, Callable]], optional) – A list of tools (callable functions) that will be accessible to the model. If the template does not support function calling, this argument will have no effect. Each tool should be passed as a JSON Schema, giving the name, description and argument types for the tool. See our [tool use guide](https://huggingface.co/docs/transformers/en/chat_extras#passing-tools) for more information.
documents (list[dict[str, str]], optional) – A list of dicts representing documents that will be accessible to the model if it is performing RAG (retrieval-augmented generation). If the template does not support RAG, this argument will have no effect. We recommend that each document should be a dict containing “title” and “text” keys.
chat_template (str, optional) – A Jinja template to use for this conversion. It is usually not necessary to pass anything to this argument, as the model’s template will be used by default.
add_generation_prompt (bool, optional) – If this is set, a prompt with the token(s) that indicate the start of an assistant message will be appended to the formatted output. This is useful when you want to generate a response from the model. Note that this argument will be passed to the chat template, and so it must be supported in the template for this argument to have any effect.
continue_final_message (bool, optional) – If this is set, the chat will be formatted so that the final message in the chat is open-ended, without any EOS tokens. The model will continue this message rather than starting a new one. This allows you to “prefill” part of the model’s response for it. Cannot be used at the same time as add_generation_prompt.
tokenize (bool, defaults to True) – Whether to tokenize the output. If False, the output will be a string.
padding (bool, str or [~utils.PaddingStrategy], optional, defaults to False) –

Select a strategy to pad the returned sequences (according to the model’s padding side and padding
index) among:
- True or ‘longest’: Pad to the longest sequence in the batch (or no padding if only a single sequence if provided).
- ’max_length’: Pad to a maximum length specified with the argument max_length or to the maximum acceptable input length for the model if that argument is not provided.
- False or ‘do_not_pad’ (default): No padding (i.e., can output a batch with sequences of different lengths).
truncation (bool, defaults to False) – Whether to truncate sequences at the maximum length. Has no effect if tokenize is False.
max_length (int, optional) – Maximum length (in tokens) to use for padding or truncation. Has no effect if tokenize is False. If not specified, the tokenizer’s max_length attribute will be used as a default.
return_tensors (str or [~utils.TensorType], optional) – If set, will return tensors of a particular framework. Has no effect if tokenize is False. Acceptable values are: - ‘tf’: Return TensorFlow tf.Tensor objects. - ‘pt’: Return PyTorch torch.Tensor objects. - ‘np’: Return NumPy np.ndarray objects. - ‘jax’: Return JAX jnp.ndarray objects.
return_dict (bool, defaults to False) – Whether to return a dictionary with named outputs. Has no effect if tokenize is False.
(`dict[str (tokenizer_kwargs) –
Any]`, optional): Additional kwargs to pass to the tokenizer.
return_assistant_tokens_mask (bool, defaults to False) – Whether to return a mask of the assistant generated tokens. For tokens generated by the assistant, the mask will contain 1. For user and system tokens, the mask will contain 0. This functionality is only available for chat templates that support it via the {% generation %} keyword.
**kwargs – Additional kwargs to pass to the template renderer. Will be accessible by the chat template.

Returns:

A list of token ids representing the tokenized chat so far, including control tokens. This output is ready to pass to the model, either directly or via methods like generate(). If return_dict is set, will return a dict of tokenizer outputs instead.

Return type:

Union[list[int], Dict]

as_target_tokenizer()[source]¶: Temporarily sets the tokenizer for encoding the targets. Useful for tokenizer associated to sequence-to-sequence models that need a slightly different processing for the labels.

property backend_tokenizer: Tokenizer[source]¶

The Rust tokenizer used as a backend.

Type:: tokenizers.implementations.BaseTokenizer

Convert a list of lists of token ids into a list of strings by calling decode.

Parameters:

sequences (Union[list[int], list[list[int]], np.ndarray, torch.Tensor, tf.Tensor]) – List of tokenized input ids. Can be obtained using the __call__ method.
skip_special_tokens (bool, optional, defaults to False) – Whether or not to remove special tokens in the decoding.
clean_up_tokenization_spaces (bool, optional) – Whether or not to clean up the tokenization spaces. If None, will default to self.clean_up_tokenization_spaces.
kwargs (additional keyword arguments, optional) – Will be passed to the underlying model specific decode method.

Returns:

The list of decoded sentences.

Return type:

list[str]

batch_encode_plus(batch_text_or_text_pairs: list[str] | list[tuple[str, str]] | list[list[str]] | list[tuple[list[str], list[str]]] | list[list[int]] | list[tuple[list[int], list[int]]], add_special_tokens: bool = True, padding: bool | str | PaddingStrategy = False, truncation: bool | str | TruncationStrategy | None = None, max_length: int | None = None, stride: int = 0, is_split_into_words: bool = False, pad_to_multiple_of: int | None = None, padding_side: str | None = None, return_tensors: str | TensorType | None = None, return_token_type_ids: bool | None = None, return_attention_mask: bool | None = None, return_overflowing_tokens: bool = False, return_special_tokens_mask: bool = False, return_offsets_mapping: bool = False, return_length: bool = False, verbose: bool = True, split_special_tokens: bool = False, **kwargs) → BatchEncoding[source]¶

Tokenize and prepare for the model a list of sequences or a list of pairs of sequences.

This method is deprecated, __call__ should be used instead.

</Tip>

Parameters:

batch_text_or_text_pairs (list[str], list[tuple[str, str]], list[list[str]], list[tuple[list[str], list[str]]], and for not-fast tokenizers, also list[list[int]], list[tuple[list[int], list[int]]]) – Batch of sequences or pair of sequences to be encoded. This can be a list of string/string-sequences/int-sequences or a list of pair of string/string-sequences/int-sequence (see details in encode_plus).
add_special_tokens (bool, optional, defaults to True) – Whether or not to add special tokens when encoding the sequences. This will use the underlying PretrainedTokenizerBase.build_inputs_with_special_tokens function, which defines which tokens are automatically added to the input ids. This is useful if you want to add bos or eos tokens automatically.
padding (bool, str or [~utils.PaddingStrategy], optional, defaults to False) –
Activates and controls padding. Accepts the following values:
- True or ‘longest’: Pad to the longest sequence in the batch (or no padding if only a single sequence is provided).
- ’max_length’: Pad to a maximum length specified with the argument max_length or to the maximum acceptable input length for the model if that argument is not provided.
- False or ‘do_not_pad’ (default): No padding (i.e., can output a batch with sequences of different lengths).
truncation (bool, str or [~tokenization_utils_base.TruncationStrategy], optional, defaults to False) –
Activates and controls truncation. Accepts the following values:
- True or ‘longest_first’: Truncate to a maximum length specified with the argument max_length or to the maximum acceptable input length for the model if that argument is not provided. This will truncate token by token, removing a token from the longest sequence in the pair if a pair of sequences (or a batch of pairs) is provided.
- ’only_first’: Truncate to a maximum length specified with the argument max_length or to the maximum acceptable input length for the model if that argument is not provided. This will only truncate the first sequence of a pair if a pair of sequences (or a batch of pairs) is provided.
- ’only_second’: Truncate to a maximum length specified with the argument max_length or to the maximum acceptable input length for the model if that argument is not provided. This will only truncate the second sequence of a pair if a pair of sequences (or a batch of pairs) is provided.
- False or ‘do_not_truncate’ (default): No truncation (i.e., can output batch with sequence lengths greater than the model maximum admissible input size).
max_length (int, optional) –
Controls the maximum length to use by one of the truncation/padding parameters.

If left unset or set to None, this will use the predefined model maximum length if a maximum length is required by one of the truncation/padding parameters. If the model has no specific maximum input length (like XLNet) truncation/padding to a maximum length will be deactivated.
stride (int, optional, defaults to 0) – If set to a number along with max_length, the overflowing tokens returned when return_overflowing_tokens=True will contain some tokens from the end of the truncated sequence returned to provide some overlap between truncated and overflowing sequences. The value of this argument defines the number of overlapping tokens.
is_split_into_words (bool, optional, defaults to False) – Whether or not the input is already pre-tokenized (e.g., split into words). If set to True, the tokenizer assumes the input is already split into words (for instance, by splitting it on whitespace) which it will tokenize. This is useful for NER or token classification.
pad_to_multiple_of (int, optional) – If set will pad the sequence to a multiple of the provided value. Requires padding to be activated. This is especially useful to enable the use of Tensor Cores on NVIDIA hardware with compute capability >= 7.5 (Volta).
padding_side (str, optional) – The side on which the model should have padding applied. Should be selected between [‘right’, ‘left’]. Default value is picked from the class attribute of the same name.
return_tensors (str or [~utils.TensorType], optional) –
If set, will return tensors instead of list of python integers. Acceptable values are:
- ’tf’: Return TensorFlow tf.constant objects.
- ’pt’: Return PyTorch torch.Tensor objects.
- ’np’: Return Numpy np.ndarray objects.
return_token_type_ids (bool, optional) –
Whether to return token type IDs. If left to the default, will return the token type IDs according to the specific tokenizer’s default, defined by the return_outputs attribute.

[What are token type IDs?](../glossary#token-type-ids)
return_attention_mask (bool, optional) –
Whether to return the attention mask. If left to the default, will return the attention mask according to the specific tokenizer’s default, defined by the return_outputs attribute.

[What are attention masks?](../glossary#attention-mask)
return_overflowing_tokens (bool, optional, defaults to False) – Whether or not to return overflowing token sequences. If a pair of sequences of input ids (or a batch of pairs) is provided with truncation_strategy = longest_first or True, an error is raised instead of returning overflowing tokens.
return_special_tokens_mask (bool, optional, defaults to False) – Whether or not to return special tokens mask information.
return_offsets_mapping (bool, optional, defaults to False) –
Whether or not to return (char_start, char_end) for each token.

This is only available on fast tokenizers inheriting from [PreTrainedTokenizerFast], if using Python’s tokenizer, this method will raise NotImplementedError.
return_length (bool, optional, defaults to False) – Whether or not to return the lengths of the encoded inputs.
verbose (bool, optional, defaults to True) – Whether or not to print more information and warnings.
**kwargs – passed to the self.tokenize() method

Returns:

A [BatchEncoding] with the following fields:

input_ids – List of token ids to be fed to a model.

[What are input IDs?](../glossary#input-ids)
token_type_ids – List of token type ids to be fed to a model (when return_token_type_ids=True or if “token_type_ids” is in self.model_input_names).

[What are token type IDs?](../glossary#token-type-ids)
attention_mask – List of indices specifying which tokens should be attended to by the model (when return_attention_mask=True or if “attention_mask” is in self.model_input_names).

[What are attention masks?](../glossary#attention-mask)
overflowing_tokens – List of overflowing tokens sequences (when a max_length is specified and return_overflowing_tokens=True).
num_truncated_tokens – Number of tokens truncated (when a max_length is specified and return_overflowing_tokens=True).
special_tokens_mask – List of 0s and 1s, with 1 specifying added special tokens and 0 specifying regular sequence tokens (when add_special_tokens=True and return_special_tokens_mask=True).
length – The length of the inputs (when return_length=True)

Return type:

[BatchEncoding]

build_inputs_with_special_tokens(token_ids_0: list[int], token_ids_1: list[int] | None = None) → list[int][source]¶

Build model inputs from a sequence or a pair of sequence for sequence classification tasks by concatenating and adding special tokens.

This implementation does not add special tokens and this method should be overridden in a subclass.

Parameters:

token_ids_0 (list[int]) – The first tokenized sequence.
token_ids_1 (list[int], optional) – The second tokenized sequence.

Returns:

The model input with special tokens.

Return type:

list[int]

property can_save_slow_tokenizer: bool[source]¶

Whether or not the slow tokenizer can be saved. For a sentencepiece based slow tokenizer, this can only be True if the original “sentencepiece.model” was not deleted.

Type:: bool

static clean_up_tokenization(out_string: str) → str[source]¶

Clean up a list of simple English tokenization artifacts like spaces before punctuations and abbreviated forms.

Parameters:: out_string (str) – The text to clean up.
Returns:: The cleaned-up string.
Return type:: str

convert_ids_to_tokens(ids: int | list[int], skip_special_tokens: bool = False) → str | list[str][source]¶

Converts a single index or a sequence of indices in a token or a sequence of tokens, using the vocabulary and added tokens.

Parameters:

ids (int or list[int]) – The token id (or token ids) to convert to tokens.
skip_special_tokens (bool, optional, defaults to False) – Whether or not to remove special tokens in the decoding.

Returns:

The decoded token(s).

Return type:

str or list[str]

convert_tokens_to_ids(tokens: str | Iterable[str]) → int | list[int][source]¶

Converts a token string (or a sequence of tokens) in a single integer id (or a Iterable of ids), using the vocabulary.

Parameters:: tokens (str or Iterable[str]) – One or several token(s) to convert to token id(s).
Returns:: The token id or list of token ids.
Return type:: int or list[int]

convert_tokens_to_string(tokens: list[str]) → str[source]¶

Converts a sequence of tokens in a single string. The most simple way to do it is “ “.join(tokens) but we often want to remove sub-word tokenization artifacts at the same time.

Parameters:: tokens (list[str]) – The token to join in a string.
Returns:: The joined tokens.
Return type:: str

create_token_type_ids_from_sequences(token_ids_0: list[int], token_ids_1: list[int] | None = None) → list[int][source]¶

Create the token type IDs corresponding to the sequences passed. [What are token type IDs?](../glossary#token-type-ids)

Should be overridden in a subclass if the model has a special way of building those.

Parameters:

token_ids_0 (list[int]) – The first tokenized sequence.
token_ids_1 (list[int], optional) – The second tokenized sequence.

Returns:

The token type ids.

Return type:

list[int]

decode(token_ids: int | list[int] | ndarray | torch.Tensor, skip_special_tokens: bool = False, clean_up_tokenization_spaces: bool | None = None, **kwargs) → str[source]¶

Converts a sequence of ids in a string, using the tokenizer and vocabulary with options to remove special tokens and clean up tokenization spaces.

Similar to doing self.convert_tokens_to_string(self.convert_ids_to_tokens(token_ids)).

Parameters:

token_ids (Union[int, list[int], np.ndarray, torch.Tensor, tf.Tensor]) – List of tokenized input ids. Can be obtained using the __call__ method.
skip_special_tokens (bool, optional, defaults to False) – Whether or not to remove special tokens in the decoding.
clean_up_tokenization_spaces (bool, optional) – Whether or not to clean up the tokenization spaces. If None, will default to self.clean_up_tokenization_spaces.
kwargs (additional keyword arguments, optional) – Will be passed to the underlying model specific decode method.

Returns:

The decoded sentence.

Return type:

str

property decoder: Decoder[source]¶

The Rust decoder for this tokenizer.

Type:: tokenizers.decoders.Decoder

Converts a string to a sequence of ids (integer), using the tokenizer and vocabulary.

Same as doing self.convert_tokens_to_ids(self.tokenize(text)).

Parameters:

text (str, list[str] or list[int]) – The first sequence to be encoded. This can be a string, a list of strings (tokenized string using the tokenize method) or a list of integers (tokenized string ids using the convert_tokens_to_ids method).
text_pair (str, list[str] or list[int], optional) – Optional second sequence to be encoded. This can be a string, a list of strings (tokenized string using the tokenize method) or a list of integers (tokenized string ids using the convert_tokens_to_ids method).
add_special_tokens (bool, optional, defaults to True) – Whether or not to add special tokens when encoding the sequences. This will use the underlying PretrainedTokenizerBase.build_inputs_with_special_tokens function, which defines which tokens are automatically added to the input ids. This is useful if you want to add bos or eos tokens automatically.
padding (bool, str or [~utils.PaddingStrategy], optional, defaults to False) –
Activates and controls padding. Accepts the following values:
- True or ‘longest’: Pad to the longest sequence in the batch (or no padding if only a single sequence is provided).
- ’max_length’: Pad to a maximum length specified with the argument max_length or to the maximum acceptable input length for the model if that argument is not provided.
- False or ‘do_not_pad’ (default): No padding (i.e., can output a batch with sequences of different lengths).
truncation (bool, str or [~tokenization_utils_base.TruncationStrategy], optional, defaults to False) –
Activates and controls truncation. Accepts the following values:
- True or ‘longest_first’: Truncate to a maximum length specified with the argument max_length or to the maximum acceptable input length for the model if that argument is not provided. This will truncate token by token, removing a token from the longest sequence in the pair if a pair of sequences (or a batch of pairs) is provided.
- ’only_first’: Truncate to a maximum length specified with the argument max_length or to the maximum acceptable input length for the model if that argument is not provided. This will only truncate the first sequence of a pair if a pair of sequences (or a batch of pairs) is provided.
- ’only_second’: Truncate to a maximum length specified with the argument max_length or to the maximum acceptable input length for the model if that argument is not provided. This will only truncate the second sequence of a pair if a pair of sequences (or a batch of pairs) is provided.
- False or ‘do_not_truncate’ (default): No truncation (i.e., can output batch with sequence lengths greater than the model maximum admissible input size).
max_length (int, optional) –
Controls the maximum length to use by one of the truncation/padding parameters.

If left unset or set to None, this will use the predefined model maximum length if a maximum length is required by one of the truncation/padding parameters. If the model has no specific maximum input length (like XLNet) truncation/padding to a maximum length will be deactivated.
stride (int, optional, defaults to 0) – If set to a number along with max_length, the overflowing tokens returned when return_overflowing_tokens=True will contain some tokens from the end of the truncated sequence returned to provide some overlap between truncated and overflowing sequences. The value of this argument defines the number of overlapping tokens.
is_split_into_words (bool, optional, defaults to False) – Whether or not the input is already pre-tokenized (e.g., split into words). If set to True, the tokenizer assumes the input is already split into words (for instance, by splitting it on whitespace) which it will tokenize. This is useful for NER or token classification.
pad_to_multiple_of (int, optional) – If set will pad the sequence to a multiple of the provided value. Requires padding to be activated. This is especially useful to enable the use of Tensor Cores on NVIDIA hardware with compute capability >= 7.5 (Volta).
padding_side (str, optional) – The side on which the model should have padding applied. Should be selected between [‘right’, ‘left’]. Default value is picked from the class attribute of the same name.
return_tensors (str or [~utils.TensorType], optional) –
If set, will return tensors instead of list of python integers. Acceptable values are:
- ’tf’: Return TensorFlow tf.constant objects.
- ’pt’: Return PyTorch torch.Tensor objects.
- ’np’: Return Numpy np.ndarray objects.
**kwargs – Passed along to the .tokenize() method.

Returns:

The tokenized ids of the text.

Return type:

list[int], torch.Tensor, tf.Tensor or np.ndarray

encode_message_with_chat_template(message: dict[str, str], conversation_history: list[dict[str, str]] | None = None, **kwargs) → list[int][source]¶

Tokenize a single message. This method is a convenience wrapper around apply_chat_template that allows you to tokenize messages one by one. This is useful for things like token-by-token streaming. This method is not guaranteed to be perfect. For some models, it may be impossible to robustly tokenize single messages. For example, if the chat template adds tokens after each message, but also has a prefix that is added to the entire chat, it will be impossible to distinguish a chat-start-token from a message-start-token. In these cases, this method will do its best to find the correct tokenization, but it may not be perfect. Note: This method does not support add_generation_prompt. If you want to add a generation prompt, you should do it separately after tokenizing the conversation. :param message: A dictionary with “role” and “content” keys, representing the message to tokenize. :type message: dict :param conversation_history: A list of dicts with “role” and “content” keys, representing the chat history so far. If you are

tokenizing messages one by one, you should pass the previous messages in the conversation here.

Parameters:: **kwargs – Additional kwargs to pass to the apply_chat_template method.
Returns:: A list of token ids representing the tokenized message.
Return type:: list[int]

Tokenize and prepare for the model a sequence or a pair of sequences.

This method is deprecated, __call__ should be used instead.

</Tip>

Parameters:

text (str, list[str] or (for non-fast tokenizers) list[int]) – The first sequence to be encoded. This can be a string, a list of strings (tokenized string using the tokenize method) or a list of integers (tokenized string ids using the convert_tokens_to_ids method).
text_pair (str, list[str] or list[int], optional) – Optional second sequence to be encoded. This can be a string, a list of strings (tokenized string using the tokenize method) or a list of integers (tokenized string ids using the convert_tokens_to_ids method).
add_special_tokens (bool, optional, defaults to True) – Whether or not to add special tokens when encoding the sequences. This will use the underlying PretrainedTokenizerBase.build_inputs_with_special_tokens function, which defines which tokens are automatically added to the input ids. This is useful if you want to add bos or eos tokens automatically.
padding (bool, str or [~utils.PaddingStrategy], optional, defaults to False) –
Activates and controls padding. Accepts the following values:
- True or ‘longest’: Pad to the longest sequence in the batch (or no padding if only a single sequence is provided).
- ’max_length’: Pad to a maximum length specified with the argument max_length or to the maximum acceptable input length for the model if that argument is not provided.
- False or ‘do_not_pad’ (default): No padding (i.e., can output a batch with sequences of different lengths).
truncation (bool, str or [~tokenization_utils_base.TruncationStrategy], optional, defaults to False) –
Activates and controls truncation. Accepts the following values:
- True or ‘longest_first’: Truncate to a maximum length specified with the argument max_length or to the maximum acceptable input length for the model if that argument is not provided. This will truncate token by token, removing a token from the longest sequence in the pair if a pair of sequences (or a batch of pairs) is provided.
- ’only_first’: Truncate to a maximum length specified with the argument max_length or to the maximum acceptable input length for the model if that argument is not provided. This will only truncate the first sequence of a pair if a pair of sequences (or a batch of pairs) is provided.
- ’only_second’: Truncate to a maximum length specified with the argument max_length or to the maximum acceptable input length for the model if that argument is not provided. This will only truncate the second sequence of a pair if a pair of sequences (or a batch of pairs) is provided.
- False or ‘do_not_truncate’ (default): No truncation (i.e., can output batch with sequence lengths greater than the model maximum admissible input size).
max_length (int, optional) –
Controls the maximum length to use by one of the truncation/padding parameters.

If left unset or set to None, this will use the predefined model maximum length if a maximum length is required by one of the truncation/padding parameters. If the model has no specific maximum input length (like XLNet) truncation/padding to a maximum length will be deactivated.
stride (int, optional, defaults to 0) – If set to a number along with max_length, the overflowing tokens returned when return_overflowing_tokens=True will contain some tokens from the end of the truncated sequence returned to provide some overlap between truncated and overflowing sequences. The value of this argument defines the number of overlapping tokens.
is_split_into_words (bool, optional, defaults to False) – Whether or not the input is already pre-tokenized (e.g., split into words). If set to True, the tokenizer assumes the input is already split into words (for instance, by splitting it on whitespace) which it will tokenize. This is useful for NER or token classification.
pad_to_multiple_of (int, optional) – If set will pad the sequence to a multiple of the provided value. Requires padding to be activated. This is especially useful to enable the use of Tensor Cores on NVIDIA hardware with compute capability >= 7.5 (Volta).
padding_side (str, optional) – The side on which the model should have padding applied. Should be selected between [‘right’, ‘left’]. Default value is picked from the class attribute of the same name.
return_tensors (str or [~utils.TensorType], optional) –
If set, will return tensors instead of list of python integers. Acceptable values are:
- ’tf’: Return TensorFlow tf.constant objects.
- ’pt’: Return PyTorch torch.Tensor objects.
- ’np’: Return Numpy np.ndarray objects.
return_token_type_ids (bool, optional) –
Whether to return token type IDs. If left to the default, will return the token type IDs according to the specific tokenizer’s default, defined by the return_outputs attribute.

[What are token type IDs?](../glossary#token-type-ids)
return_attention_mask (bool, optional) –
Whether to return the attention mask. If left to the default, will return the attention mask according to the specific tokenizer’s default, defined by the return_outputs attribute.

[What are attention masks?](../glossary#attention-mask)
return_overflowing_tokens (bool, optional, defaults to False) – Whether or not to return overflowing token sequences. If a pair of sequences of input ids (or a batch of pairs) is provided with truncation_strategy = longest_first or True, an error is raised instead of returning overflowing tokens.
return_special_tokens_mask (bool, optional, defaults to False) – Whether or not to return special tokens mask information.
return_offsets_mapping (bool, optional, defaults to False) –
Whether or not to return (char_start, char_end) for each token.

This is only available on fast tokenizers inheriting from [PreTrainedTokenizerFast], if using Python’s tokenizer, this method will raise NotImplementedError.
return_length (bool, optional, defaults to False) – Whether or not to return the lengths of the encoded inputs.
verbose (bool, optional, defaults to True) – Whether or not to print more information and warnings.
**kwargs – passed to the self.tokenize() method

Returns:

A [BatchEncoding] with the following fields:

input_ids – List of token ids to be fed to a model.

[What are input IDs?](../glossary#input-ids)
token_type_ids – List of token type ids to be fed to a model (when return_token_type_ids=True or if “token_type_ids” is in self.model_input_names).

[What are token type IDs?](../glossary#token-type-ids)
attention_mask – List of indices specifying which tokens should be attended to by the model (when return_attention_mask=True or if “attention_mask” is in self.model_input_names).

[What are attention masks?](../glossary#attention-mask)
overflowing_tokens – List of overflowing tokens sequences (when a max_length is specified and return_overflowing_tokens=True).
num_truncated_tokens – Number of tokens truncated (when a max_length is specified and return_overflowing_tokens=True).
special_tokens_mask – List of 0s and 1s, with 1 specifying added special tokens and 0 specifying regular sequence tokens (when add_special_tokens=True and return_special_tokens_mask=True).
length – The length of the inputs (when return_length=True)

Return type:

[BatchEncoding]

featurize(datapoints: Iterable[Any], log_every_n: int = 1000, **kwargs) → ndarray[source]¶

Calculate features for datapoints.

Parameters:

datapoints (Iterable[Any]) – A sequence of objects that you’d like to featurize. Subclassses of Featurizer should instantiate the _featurize method that featurizes objects in the sequence.
log_every_n (int, default 1000) – Logs featurization progress every log_every_n steps.

Returns:

A numpy array containing a featurized representation of datapoints.

Return type:

np.ndarray

classmethod from_pretrained(pretrained_model_name_or_path: str | PathLike, *init_inputs, cache_dir: str | PathLike | None = None, force_download: bool = False, local_files_only: bool = False, token: bool | str | None = None, revision: str = 'main', trust_remote_code=False, **kwargs)[source]¶

Instantiate a [~tokenization_utils_base.PreTrainedTokenizerBase] (or a derived class) from a predefined tokenizer.

Parameters:

pretrained_model_name_or_path (str or os.PathLike) –
Can be either:
- A string, the model id of a predefined tokenizer hosted inside a model repo on huggingface.co.
- A path to a directory containing vocabulary files required by the tokenizer, for instance saved using the [~tokenization_utils_base.PreTrainedTokenizerBase.save_pretrained] method, e.g., ./my_model_directory/.
- (Deprecated, not applicable to all derived classes) A path or url to a single saved vocabulary file (if and only if the tokenizer only requires a single vocabulary file like Bert or XLNet), e.g., ./my_model_directory/vocab.txt.
cache_dir (str or os.PathLike, optional) – Path to a directory in which a downloaded predefined tokenizer vocabulary files should be cached if the standard cache should not be used.
force_download (bool, optional, defaults to False) – Whether or not to force the (re-)download the vocabulary files and override the cached versions if they exist.
resume_download – Deprecated and ignored. All downloads are now resumed by default when possible. Will be removed in v5 of Transformers.
proxies (dict[str, str], optional) – A dictionary of proxy servers to use by protocol or endpoint, e.g., {‘http’: ‘foo.bar:3128’, ‘http://hostname’: ‘foo.bar:4012’}. The proxies are used on each request.
token (str or bool, optional) – The token to use as HTTP bearer authorization for remote files. If True, will use the token generated when running hf auth login (stored in ~/.huggingface).
local_files_only (bool, optional, defaults to False) – Whether or not to only rely on local files and not to attempt to download any files.
revision (str, optional, defaults to “main”) – The specific model version to use. It can be a branch name, a tag name, or a commit id, since we use a git-based system for storing models and other artifacts on huggingface.co, so revision can be any identifier allowed by git.
subfolder (str, optional) – In case the relevant files are located inside a subfolder of the model repo on huggingface.co (e.g. for facebook/rag-token-base), specify it here.
inputs (additional positional arguments, optional) – Will be passed along to the Tokenizer __init__ method.
trust_remote_code (bool, optional, defaults to False) – Whether or not to allow for custom models defined on the Hub in their own modeling files. This option should only be set to True for repositories you trust and in which you have read the code, as it will execute code present on the Hub on your local machine.
kwargs (additional keyword arguments, optional) – Will be passed to the Tokenizer __init__ method. Can be used to set special tokens like bos_token, eos_token, unk_token, sep_token, pad_token, cls_token, mask_token, additional_special_tokens. See parameters in the __init__ for more details.

<Tip>

Passing token=True is required when you want to use a private model.

</Tip>

Examples:

```python # We can’t instantiate directly the base class PreTrainedTokenizerBase so let’s show our examples on a derived class: BertTokenizer # Download vocabulary from huggingface.co and cache. tokenizer = BertTokenizer.from_pretrained(“google-bert/bert-base-uncased”)

# Download vocabulary from huggingface.co (user-uploaded) and cache. tokenizer = BertTokenizer.from_pretrained(“dbmdz/bert-base-german-cased”)

# If vocabulary files are in a directory (e.g. tokenizer was saved using save_pretrained(‘./test/saved_model/’)) tokenizer = BertTokenizer.from_pretrained(“./test/saved_model/”)

# If the tokenizer uses a single vocabulary file, you can point directly to this file tokenizer = BertTokenizer.from_pretrained(“./test/saved_model/my_vocab.txt”)

# You can link tokens to special vocabulary when instantiating tokenizer = BertTokenizer.from_pretrained(“google-bert/bert-base-uncased”, unk_token=”<unk>”) # You should be sure ‘<unk>’ is in the vocabulary when doing that. # Otherwise use tokenizer.add_special_tokens({‘unk_token’: ‘<unk>’}) instead) assert tokenizer.unk_token == “<unk>” ```

get_added_vocab() → dict[str, int][source]¶

Returns the added tokens in the vocabulary as a dictionary of token to index.

Returns:: The added tokens.
Return type:: dict[str, int]

get_chat_template(chat_template: str | None = None, tools: list[dict] | None = None) → str[source]¶

Retrieve the chat template string used for tokenizing chat messages. This template is used internally by the apply_chat_template method and can also be used externally to retrieve the model’s chat template for better generation tracking.

Parameters:

chat_template (str, optional) – A Jinja template or the name of a template to use for this conversion. It is usually not necessary to pass anything to this argument, as the model’s template will be used by default.
tools (list[Dict], optional) – A list of tools (callable functions) that will be accessible to the model. If the template does not support function calling, this argument will have no effect. Each tool should be passed as a JSON Schema, giving the name, description and argument types for the tool. See our [chat templating guide](https://huggingface.co/docs/transformers/main/en/chat_templating#automated-function-conversion-for-tool-use) for more information.

Returns:

The chat template string.

Return type:

str

get_special_tokens_mask(token_ids_0: list[int], token_ids_1: list[int] | None = None, already_has_special_tokens: bool = False) → list[int][source]¶

Retrieves sequence ids from a token list that has no special tokens added. This method is called when adding special tokens using the tokenizer prepare_for_model or encode_plus methods.

Parameters:

token_ids_0 (list[int]) – List of ids of the first sequence.
token_ids_1 (list[int], optional) – List of ids of the second sequence.
already_has_special_tokens (bool, optional, defaults to False) – Whether or not the token list is already formatted with special tokens for the model.

Returns:

1 for a special token, 0 for a sequence token.

Return type:

A list of integers in the range [0, 1]

get_vocab() → dict[str, int][source]¶

Returns the vocabulary as a dictionary of token to index.

tokenizer.get_vocab()[token] is equivalent to tokenizer.convert_tokens_to_ids(token) when token is in the vocab.

Returns:: The vocabulary.
Return type:: dict[str, int]

property max_len_sentences_pair: int[source]¶

The maximum combined length of a pair of sentences that can be fed to the model.

Type:: int

property max_len_single_sentence: int[source]¶

The maximum length of a sentence that can be fed to the model.

Type:: int

num_special_tokens_to_add(pair: bool = False) → int[source]¶

Returns the number of added tokens when encoding a sequence with special tokens.

<Tip>

This encodes a dummy input and checks the number of added tokens, and is therefore not efficient. Do not put this inside your training loop.

</Tip>

Parameters:: pair (bool, optional, defaults to False) – Whether the number of added tokens should be computed in the case of a sequence pair or a single sequence.
Returns:: Number of special tokens added to sequences.
Return type:: int

Pad a single encoded input or a batch of encoded inputs up to predefined length or to the max sequence length in the batch.

Padding side (left/right) padding token ids are defined at the tokenizer level (with self.padding_side, self.pad_token_id and self.pad_token_type_id).

Please note that with a fast tokenizer, using the __call__ method is faster than using a method to encode the text followed by a call to the pad method to get a padded encoding.

<Tip>

If the encoded_inputs passed are dictionary of numpy arrays, PyTorch tensors or TensorFlow tensors, the result will use the same type unless you provide a different tensor type with return_tensors. In the case of PyTorch tensors, you will lose the specific device of your tensors however.

</Tip>

Parameters:

encoded_inputs ([BatchEncoding], list of [BatchEncoding], dict[str, list[int]], dict[str, list[list[int]] or list[dict[str, list[int]]]) –
Tokenized inputs. Can represent one input ([BatchEncoding] or dict[str, list[int]]) or a batch of tokenized inputs (list of [BatchEncoding], dict[str, list[list[int]]] or list[dict[str, list[int]]]) so you can use this method during preprocessing as well as in a PyTorch Dataloader collate function.

Instead of list[int] you can have tensors (numpy arrays, PyTorch tensors or TensorFlow tensors), see the note above for the return type.
padding (bool, str or [~utils.PaddingStrategy], optional, defaults to True) –

Select a strategy to pad the returned sequences (according to the model’s padding side and padding
index) among:
- True or ‘longest’ (default): Pad to the longest sequence in the batch (or no padding if only a single sequence if provided).
- ’max_length’: Pad to a maximum length specified with the argument max_length or to the maximum acceptable input length for the model if that argument is not provided.
- False or ‘do_not_pad’: No padding (i.e., can output a batch with sequences of different lengths).
max_length (int, optional) – Maximum length of the returned list and optionally padding length (see above).
pad_to_multiple_of (int, optional) –
If set will pad the sequence to a multiple of the provided value.

This is especially useful to enable the use of Tensor Cores on NVIDIA hardware with compute capability >= 7.5 (Volta).
padding_side (str, optional) – The side on which the model should have padding applied. Should be selected between [‘right’, ‘left’]. Default value is picked from the class attribute of the same name.
return_attention_mask (bool, optional) –
Whether to return the attention mask. If left to the default, will return the attention mask according to the specific tokenizer’s default, defined by the return_outputs attribute.

[What are attention masks?](../glossary#attention-mask)
return_tensors (str or [~utils.TensorType], optional) –
If set, will return tensors instead of list of python integers. Acceptable values are:
- ’tf’: Return TensorFlow tf.constant objects.
- ’pt’: Return PyTorch torch.Tensor objects.
- ’np’: Return Numpy np.ndarray objects.
verbose (bool, optional, defaults to True) – Whether or not to print more information and warnings.

property pad_token_type_id: int[source]¶

Id of the padding token type in the vocabulary.

Type:: int

prepare_for_model(ids: list[int], pair_ids: list[int] | None = None, add_special_tokens: bool = True, padding: bool | str | PaddingStrategy = False, truncation: bool | str | TruncationStrategy | None = None, max_length: int | None = None, stride: int = 0, pad_to_multiple_of: int | None = None, padding_side: str | None = None, return_tensors: str | TensorType | None = None, return_token_type_ids: bool | None = None, return_attention_mask: bool | None = None, return_overflowing_tokens: bool = False, return_special_tokens_mask: bool = False, return_offsets_mapping: bool = False, return_length: bool = False, verbose: bool = True, prepend_batch_axis: bool = False, **kwargs) → BatchEncoding[source]¶

Prepares a sequence of input id, or a pair of sequences of inputs ids so that it can be used by the model. It adds special tokens, truncates sequences if overflowing while taking into account the special tokens and manages a moving window (with user defined stride) for overflowing tokens. Please Note, for pair_ids different than None and truncation_strategy = longest_first or True, it is not possible to return overflowing tokens. Such a combination of arguments will raise an error.

Parameters:

ids (list[int]) – Tokenized input ids of the first sequence. Can be obtained from a string by chaining the tokenize and convert_tokens_to_ids methods.
pair_ids (list[int], optional) – Tokenized input ids of the second sequence. Can be obtained from a string by chaining the tokenize and convert_tokens_to_ids methods.
add_special_tokens (bool, optional, defaults to True) – Whether or not to add special tokens when encoding the sequences. This will use the underlying PretrainedTokenizerBase.build_inputs_with_special_tokens function, which defines which tokens are automatically added to the input ids. This is useful if you want to add bos or eos tokens automatically.
padding (bool, str or [~utils.PaddingStrategy], optional, defaults to False) –
Activates and controls padding. Accepts the following values:
- True or ‘longest’: Pad to the longest sequence in the batch (or no padding if only a single sequence is provided).
- ’max_length’: Pad to a maximum length specified with the argument max_length or to the maximum acceptable input length for the model if that argument is not provided.
- False or ‘do_not_pad’ (default): No padding (i.e., can output a batch with sequences of different lengths).
truncation (bool, str or [~tokenization_utils_base.TruncationStrategy], optional, defaults to False) –
Activates and controls truncation. Accepts the following values:
- True or ‘longest_first’: Truncate to a maximum length specified with the argument max_length or to the maximum acceptable input length for the model if that argument is not provided. This will truncate token by token, removing a token from the longest sequence in the pair if a pair of sequences (or a batch of pairs) is provided.
- ’only_first’: Truncate to a maximum length specified with the argument max_length or to the maximum acceptable input length for the model if that argument is not provided. This will only truncate the first sequence of a pair if a pair of sequences (or a batch of pairs) is provided.
- ’only_second’: Truncate to a maximum length specified with the argument max_length or to the maximum acceptable input length for the model if that argument is not provided. This will only truncate the second sequence of a pair if a pair of sequences (or a batch of pairs) is provided.
- False or ‘do_not_truncate’ (default): No truncation (i.e., can output batch with sequence lengths greater than the model maximum admissible input size).
max_length (int, optional) –
Controls the maximum length to use by one of the truncation/padding parameters.

If left unset or set to None, this will use the predefined model maximum length if a maximum length is required by one of the truncation/padding parameters. If the model has no specific maximum input length (like XLNet) truncation/padding to a maximum length will be deactivated.
stride (int, optional, defaults to 0) – If set to a number along with max_length, the overflowing tokens returned when return_overflowing_tokens=True will contain some tokens from the end of the truncated sequence returned to provide some overlap between truncated and overflowing sequences. The value of this argument defines the number of overlapping tokens.
is_split_into_words (bool, optional, defaults to False) – Whether or not the input is already pre-tokenized (e.g., split into words). If set to True, the tokenizer assumes the input is already split into words (for instance, by splitting it on whitespace) which it will tokenize. This is useful for NER or token classification.
pad_to_multiple_of (int, optional) – If set will pad the sequence to a multiple of the provided value. Requires padding to be activated. This is especially useful to enable the use of Tensor Cores on NVIDIA hardware with compute capability >= 7.5 (Volta).
padding_side (str, optional) – The side on which the model should have padding applied. Should be selected between [‘right’, ‘left’]. Default value is picked from the class attribute of the same name.
return_tensors (str or [~utils.TensorType], optional) –
If set, will return tensors instead of list of python integers. Acceptable values are:
- ’tf’: Return TensorFlow tf.constant objects.
- ’pt’: Return PyTorch torch.Tensor objects.
- ’np’: Return Numpy np.ndarray objects.
return_token_type_ids (bool, optional) –
Whether to return token type IDs. If left to the default, will return the token type IDs according to the specific tokenizer’s default, defined by the return_outputs attribute.

[What are token type IDs?](../glossary#token-type-ids)
return_attention_mask (bool, optional) –
Whether to return the attention mask. If left to the default, will return the attention mask according to the specific tokenizer’s default, defined by the return_outputs attribute.

[What are attention masks?](../glossary#attention-mask)
return_overflowing_tokens (bool, optional, defaults to False) – Whether or not to return overflowing token sequences. If a pair of sequences of input ids (or a batch of pairs) is provided with truncation_strategy = longest_first or True, an error is raised instead of returning overflowing tokens.
return_special_tokens_mask (bool, optional, defaults to False) – Whether or not to return special tokens mask information.
return_offsets_mapping (bool, optional, defaults to False) –
Whether or not to return (char_start, char_end) for each token.

This is only available on fast tokenizers inheriting from [PreTrainedTokenizerFast], if using Python’s tokenizer, this method will raise NotImplementedError.
return_length (bool, optional, defaults to False) – Whether or not to return the lengths of the encoded inputs.
verbose (bool, optional, defaults to True) – Whether or not to print more information and warnings.
**kwargs – passed to the self.tokenize() method

Returns:

A [BatchEncoding] with the following fields:

input_ids – List of token ids to be fed to a model.

[What are input IDs?](../glossary#input-ids)
token_type_ids – List of token type ids to be fed to a model (when return_token_type_ids=True or if “token_type_ids” is in self.model_input_names).

[What are token type IDs?](../glossary#token-type-ids)
attention_mask – List of indices specifying which tokens should be attended to by the model (when return_attention_mask=True or if “attention_mask” is in self.model_input_names).

[What are attention masks?](../glossary#attention-mask)
overflowing_tokens – List of overflowing tokens sequences (when a max_length is specified and return_overflowing_tokens=True).
num_truncated_tokens – Number of tokens truncated (when a max_length is specified and return_overflowing_tokens=True).
special_tokens_mask – List of 0s and 1s, with 1 specifying added special tokens and 0 specifying regular sequence tokens (when add_special_tokens=True and return_special_tokens_mask=True).
length – The length of the inputs (when return_length=True)

Return type:

[BatchEncoding]

prepare_seq2seq_batch(src_texts: list[str], tgt_texts: list[str] | None = None, max_length: int | None = None, max_target_length: int | None = None, padding: str = 'longest', return_tensors: str | None = None, truncation: bool = True, **kwargs) → BatchEncoding[source]¶

Prepare model inputs for translation. For best performance, translate one sentence at a time.

Parameters:

src_texts (list[str]) – List of documents to summarize or source language texts.
tgt_texts (list, optional) – List of summaries or target language texts.
max_length (int, optional) – Controls the maximum length for encoder inputs (documents to summarize or source language texts) If left unset or set to None, this will use the predefined model maximum length if a maximum length is required by one of the truncation/padding parameters. If the model has no specific maximum input length (like XLNet) truncation/padding to a maximum length will be deactivated.
max_target_length (int, optional) – Controls the maximum length of decoder inputs (target language texts or summaries) If left unset or set to None, this will use the max_length value.
padding (bool, str or [~utils.PaddingStrategy], optional, defaults to False) –
Activates and controls padding. Accepts the following values:
- True or ‘longest’: Pad to the longest sequence in the batch (or no padding if only a single sequence if provided).
- ’max_length’: Pad to a maximum length specified with the argument max_length or to the maximum acceptable input length for the model if that argument is not provided.
- False or ‘do_not_pad’ (default): No padding (i.e., can output a batch with sequences of different lengths).
return_tensors (str or [~utils.TensorType], optional) –
If set, will return tensors instead of list of python integers. Acceptable values are:
- ’tf’: Return TensorFlow tf.constant objects.
- ’pt’: Return PyTorch torch.Tensor objects.
- ’np’: Return Numpy np.ndarray objects.
truncation (bool, str or [~tokenization_utils_base.TruncationStrategy], optional, defaults to True) –
Activates and controls truncation. Accepts the following values:
- True or ‘longest_first’: Truncate to a maximum length specified with the argument max_length or to the maximum acceptable input length for the model if that argument is not provided. This will truncate token by token, removing a token from the longest sequence in the pair if a pair of sequences (or a batch of pairs) is provided.
- ’only_first’: Truncate to a maximum length specified with the argument max_length or to the maximum acceptable input length for the model if that argument is not provided. This will only truncate the first sequence of a pair if a pair of sequences (or a batch of pairs) is provided.
- ’only_second’: Truncate to a maximum length specified with the argument max_length or to the maximum acceptable input length for the model if that argument is not provided. This will only truncate the second sequence of a pair if a pair of sequences (or a batch of pairs) is provided.
- False or ‘do_not_truncate’ (default): No truncation (i.e., can output batch with sequence lengths greater than the model maximum admissible input size).
**kwargs – Additional keyword arguments passed along to self.__call__.

Returns:

A [BatchEncoding] with the following fields:

input_ids – List of token ids to be fed to the encoder.
attention_mask – List of indices specifying which tokens should be attended to by the model.
labels – List of token ids for tgt_texts.

The full set of keys [input_ids, attention_mask, labels], will only be returned if tgt_texts is passed. Otherwise, input_ids, attention_mask will be the only keys.

Return type:

[BatchEncoding]

Upload the tokenizer files to the 🤗 Model Hub.

Parameters:

repo_id (str) – The name of the repository you want to push your tokenizer to. It should contain your organization name when pushing to a given organization.
use_temp_dir (bool, optional) – Whether or not to use a temporary directory to store the files saved before they are pushed to the Hub. Will default to True if there is no directory named like repo_id, False otherwise.
commit_message (str, optional) – Message to commit while pushing. Will default to “Upload tokenizer”.
private (bool, optional) – Whether to make the repo private. If None (default), the repo will be public unless the organization’s default is private. This value is ignored if the repo already exists.
token (bool or str, optional) – The token to use as HTTP bearer authorization for remote files. If True, will use the token generated when running hf auth login (stored in ~/.huggingface). Will default to True if repo_url is not specified.
max_shard_size (int or str, optional, defaults to “5GB”) – Only applicable for models. The maximum size for a checkpoint before being sharded. Checkpoints shard will then be each of size lower than this size. If expressed as a string, needs to be digits followed by a unit (like “5MB”). We default it to “5GB” so that users can easily load models on free-tier Google Colab instances without any CPU OOM issues.
create_pr (bool, optional, defaults to False) – Whether or not to create a PR with the uploaded files or directly commit.
safe_serialization (bool, optional, defaults to True) – Whether or not to convert the model weights in safetensors format for safer serialization.
revision (str, optional) – Branch to push the uploaded files to.
commit_description (str, optional) – The description of the commit that will be created
tags (list[str], optional) – List of tags to push on the Hub.

Examples:

```python from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(“google-bert/bert-base-cased”)

# Push the tokenizer to your namespace with the name “my-finetuned-bert”. tokenizer.push_to_hub(“my-finetuned-bert”)

# Push the tokenizer to an organization with the name “my-finetuned-bert”. tokenizer.push_to_hub(“huggingface/my-finetuned-bert”) ```

classmethod register_for_auto_class(auto_class='AutoTokenizer')[source]¶

Register this class with a given auto class. This should only be used for custom tokenizers as the ones in the library are already mapped with AutoTokenizer.

Parameters:: auto_class (str or type, optional, defaults to “AutoTokenizer”) – The auto class to register this new tokenizer with.

sanitize_special_tokens() → int[source]¶: The sanitize_special_tokens is now deprecated kept for backward compatibility and will be removed in transformers v5.

save_chat_templates(save_directory: str | PathLike, tokenizer_config: dict, filename_prefix: str | None, save_jinja_files: bool)[source]¶: Writes chat templates out to the save directory if we’re using the new format, and removes them from the tokenizer config if present. If we’re using the legacy format, it doesn’t write any files, and instead writes the templates to the tokenizer config in the correct format.

save_pretrained(save_directory: str | PathLike, legacy_format: bool | None = None, filename_prefix: str | None = None, push_to_hub: bool = False, **kwargs) → tuple[str, ...][source]¶

Save the full tokenizer state.

This method make sure the full tokenizer can then be re-loaded using the [~tokenization_utils_base.PreTrainedTokenizer.from_pretrained] class method..

Warning,None This won’t save modifications you may have applied to the tokenizer after the instantiation (for instance, modifying tokenizer.do_lower_case after creation).

Parameters:

save_directory (str or os.PathLike) – The path to a directory where the tokenizer will be saved.
legacy_format (bool, optional) –
Only applicable for a fast tokenizer. If unset (default), will save the tokenizer in the unified JSON format as well as in legacy format if it exists, i.e. with tokenizer specific vocabulary and a separate added_tokens files.

If False, will only save the tokenizer in the unified JSON format. This format is incompatible with “slow” tokenizers (not powered by the tokenizers library), so the tokenizer will not be able to be loaded in the corresponding “slow” tokenizer.

If True, will save the tokenizer in legacy format. If the “slow” tokenizer doesn’t exits, a value error is raised.
filename_prefix (str, optional) – A prefix to add to the names of the files saved by the tokenizer.
push_to_hub (bool, optional, defaults to False) – Whether or not to push your model to the Hugging Face model hub after saving it. You can specify the repository you want to push to with repo_id (will default to the name of save_directory in your namespace).
kwargs (dict[str, Any], optional) – Additional key word arguments passed along to the [~utils.PushToHubMixin.push_to_hub] method.

Returns:

The files saved.

Return type:

A tuple of str

save_vocabulary(save_directory: str, filename_prefix: str | None = None) → tuple[str, ...][source]¶

Save only the vocabulary of the tokenizer (vocabulary + added tokens).

This method won’t save the configuration and special token mappings of the tokenizer. Use [~PreTrainedTokenizerFast._save_pretrained] to save the whole state of the tokenizer.

Parameters:

save_directory (str) – The directory in which to save the vocabulary.
filename_prefix (str, optional) – An optional prefix to add to the named of the saved files.

Returns:

Paths to the files saved.

Return type:

tuple(str)

set_truncation_and_padding(padding_strategy: PaddingStrategy, truncation_strategy: TruncationStrategy, max_length: int, stride: int, pad_to_multiple_of: int | None, padding_side: str | None)[source]¶

Define the truncation and the padding strategies for fast tokenizers (provided by HuggingFace tokenizers library) and restore the tokenizer settings afterwards.

The provided tokenizer has no padding / truncation strategy before the managed section. If your tokenizer set a padding / truncation strategy before, then it will be reset to no padding / truncation when exiting the managed section.

Parameters:

padding_strategy ([~utils.PaddingStrategy]) – The kind of padding that will be applied to the input
truncation_strategy ([~tokenization_utils_base.TruncationStrategy]) – The kind of truncation that will be applied to the input
max_length (int) – The maximum size of a sequence.
stride (int) – The stride to use when handling overflow.
pad_to_multiple_of (int, optional) – If set will pad the sequence to a multiple of the provided value. This is especially useful to enable the use of Tensor Cores on NVIDIA hardware with compute capability >= 7.5 (Volta).
padding_side (str, optional) – The side on which the model should have padding applied. Should be selected between [‘right’, ‘left’]. Default value is picked from the class attribute of the same name.

property special_tokens_map: dict[str, str | list[str]][source]¶

A dictionary mapping special token class attributes (cls_token, unk_token, etc.) to their values (‘<unk>’, ‘<cls>’, etc.).

Convert potential tokens of tokenizers.AddedToken type to string.

Type:: dict[str, Union[str, list[str]]]

property special_tokens_map_extended: dict[str, str | AddedToken | list[str | AddedToken]][source]¶

A dictionary mapping special token class attributes (cls_token, unk_token, etc.) to their values (‘<unk>’, ‘<cls>’, etc.).

Don’t convert tokens of tokenizers.AddedToken type to string so they can be used to control more finely how special tokens are tokenized.

Type:: dict[str, Union[str, tokenizers.AddedToken, list[Union[str, tokenizers.AddedToken]]]]

tokenize(text: str, pair: str | None = None, add_special_tokens: bool = False, **kwargs) → list[str][source]¶

Converts a string into a sequence of tokens, replacing unknown tokens with the unk_token.

Parameters:

text (str) – The sequence to be encoded.
pair (str, optional) – A second sequence to be encoded with the first.
add_special_tokens (bool, optional, defaults to False) – Whether or not to add the special tokens associated with the corresponding model.
kwargs (additional keyword arguments, optional) – Will be passed to the underlying model specific encode method. See details in [~PreTrainedTokenizerBase.__call__]

Returns:

The list of tokens.

Return type:

list[str]

train_new_from_iterator(text_iterator, vocab_size, length=None, new_special_tokens=None, special_tokens_map=None, **kwargs)[source]¶

Trains a tokenizer on a new corpus with the same defaults (in terms of special tokens or tokenization pipeline) as the current one.

Parameters:

text_iterator (generator of list[str]) – The training corpus. Should be a generator of batches of texts, for instance a list of lists of texts if you have everything in memory.
vocab_size (int) – The size of the vocabulary you want for your tokenizer.
length (int, optional) – The total number of sequences in the iterator. This is used to provide meaningful progress tracking
new_special_tokens (list of str or AddedToken, optional) – A list of new special tokens to add to the tokenizer you are training.
special_tokens_map (dict[str, str], optional) – If you want to rename some of the special tokens this tokenizer uses, pass along a mapping old special token name to new special token name in this argument.
kwargs (dict[str, Any], optional) – Additional keyword arguments passed along to the trainer from the 🤗 Tokenizers library.

Returns:

A new tokenizer of the same type as the original one, trained on text_iterator.

Return type:

[PreTrainedTokenizerFast]

truncate_sequences(ids: list[int], pair_ids: list[int] | None = None, num_tokens_to_remove: int = 0, truncation_strategy: str | TruncationStrategy = 'longest_first', stride: int = 0) → tuple[list[int], list[int], list[int]][source]¶

Truncates a sequence pair in-place following the strategy.

Parameters:

ids (list[int]) – Tokenized input ids of the first sequence. Can be obtained from a string by chaining the tokenize and convert_tokens_to_ids methods.
pair_ids (list[int], optional) – Tokenized input ids of the second sequence. Can be obtained from a string by chaining the tokenize and convert_tokens_to_ids methods.
num_tokens_to_remove (int, optional, defaults to 0) – Number of tokens to remove using the truncation strategy.
truncation_strategy (str or [~tokenization_utils_base.TruncationStrategy], optional, defaults to ‘longest_first’) –
The strategy to follow for truncation. Can be:
- ’longest_first’: Truncate to a maximum length specified with the argument max_length or to the maximum acceptable input length for the model if that argument is not provided. This will truncate token by token, removing a token from the longest sequence in the pair if a pair of sequences (or a batch of pairs) is provided.
- ’only_first’: Truncate to a maximum length specified with the argument max_length or to the maximum acceptable input length for the model if that argument is not provided. This will only truncate the first sequence of a pair if a pair of sequences (or a batch of pairs) is provided.
- ’only_second’: Truncate to a maximum length specified with the argument max_length or to the maximum acceptable input length for the model if that argument is not provided. This will only truncate the second sequence of a pair if a pair of sequences (or a batch of pairs) is provided.
- ’do_not_truncate’ (default): No truncation (i.e., can output batch with sequence lengths greater than the model maximum admissible input size).
stride (int, optional, defaults to 0) – If set to a positive number, the overflowing tokens returned will contain some tokens from the main sequence returned. The value of this argument defines the number of additional tokens.

Returns:

The truncated ids, the truncated pair_ids and the list of overflowing tokens. Note: The longest_first strategy returns empty list of overflowing tokens if a pair of sequences (or a batch of pairs) is provided.

Return type:

tuple[list[int], list[int], list[int]]

property vocab_size: int[source]¶

Size of the base vocabulary (without the added tokens).

Type:: int

Base Featurizers (for developers)¶

Featurizer ¶

The dc.feat.Featurizer class is the abstract parent class for all featurizers.

class Featurizer[source]¶

Abstract class for calculating a set of features for a datapoint.

This class is abstract and cannot be invoked directly. You will likely only interact with this class if you’re a developer. In that case, you might want to make a child class which implements the _featurize method for calculating features for a single datapoints if you’d like to make a featurizer for a new datatype.

featurize(datapoints: Iterable[Any], log_every_n: int = 1000, **kwargs) → ndarray[source]¶

Calculate features for datapoints.

Parameters:

datapoints (Iterable[Any]) – A sequence of objects that you’d like to featurize. Subclassses of Featurizer should instantiate the _featurize method that featurizes objects in the sequence.
log_every_n (int, default 1000) – Logs featurization progress every log_every_n steps.

Returns:

A numpy array containing a featurized representation of datapoints.

Return type:

np.ndarray

MolecularFeaturizer ¶

If you’re creating a new featurizer that featurizes molecules, you will want to inherit from the abstract MolecularFeaturizer base class. This featurizer can take RDKit mol objects or SMILES as inputs.

class MolecularFeaturizer(use_original_atoms_order=False)[source]¶

Abstract class for calculating a set of features for a molecule.

The defining feature of a MolecularFeaturizer is that it uses SMILES strings and RDKit molecule objects to represent small molecules. All other featurizers which are subclasses of this class should plan to process input which comes as smiles strings or RDKit molecules.

Child classes need to implement the _featurize method for calculating features for a single molecule.

The subclasses of this class require RDKit to be installed.

__init__(use_original_atoms_order=False)[source]¶

Parameters:: use_original_atoms_order (bool, default False) – Whether to use original atom ordering or canonical ordering (default)

featurize(datapoints, log_every_n=1000, **kwargs) → ndarray[source]¶

Calculate features for molecules.

Parameters:

datapoints (rdkit.Chem.rdchem.Mol / SMILES string / iterable) – RDKit Mol, or SMILES string or iterable sequence of RDKit mols/SMILES strings.
log_every_n (int, default 1000) – Logging messages reported every log_every_n samples.

Returns:

features – A numpy array containing a featurized representation of datapoints.

Return type:

np.ndarray

MaterialCompositionFeaturizer ¶

If you’re creating a new featurizer that featurizes compositional formulas, you will want to inherit from the abstract MaterialCompositionFeaturizer base class.

class MaterialCompositionFeaturizer[source]¶

Abstract class for calculating a set of features for an inorganic crystal composition.

The defining feature of a MaterialCompositionFeaturizer is that it operates on 3D crystal chemical compositions. Inorganic crystal compositions are represented by Pymatgen composition objects. Featurizers for inorganic crystal compositions that are subclasses of this class should plan to process input which comes as Pymatgen composition objects.

This class is abstract and cannot be invoked directly. You’ll likely only interact with this class if you’re a developer. Child classes need to implement the _featurize method for calculating features for a single crystal composition.

Note

Some subclasses of this class will require pymatgen and matminer to be installed.

featurize(datapoints: Iterable[str] | None = None, log_every_n: int = 1000, **kwargs) → ndarray[source]¶

Calculate features for crystal compositions.

Parameters:

datapoints (Iterable[str]) – Iterable sequence of composition strings, e.g. “MoS2”.
log_every_n (int, default 1000) – Logging messages reported every log_every_n samples.

Returns:

features – A numpy array containing a featurized representation of compositions.

Return type:

np.ndarray

MaterialStructureFeaturizer ¶

If you’re creating a new featurizer that featurizes inorganic crystal structure, you will want to inherit from the abstract MaterialCompositionFeaturizer base class. This featurizer can take pymatgen structure objects or dictionaries as inputs.

class MaterialStructureFeaturizer[source]¶

Abstract class for calculating a set of features for an inorganic crystal structure.

The defining feature of a MaterialStructureFeaturizer is that it operates on 3D crystal structures with periodic boundary conditions. Inorganic crystal structures are represented by Pymatgen structure objects. Featurizers for inorganic crystal structures that are subclasses of this class should plan to process input which comes as pymatgen structure objects.

This class is abstract and cannot be invoked directly. You’ll likely only interact with this class if you’re a developer. Child classes need to implement the _featurize method for calculating features for a single crystal structure.

Note

Some subclasses of this class will require pymatgen and matminer to be installed.

featurize(datapoints: Iterable[Dict[str, Any] | Any] | None = None, log_every_n: int = 1000, **kwargs) → ndarray[source]¶

Calculate features for crystal structures.

Parameters:

datapoints (Iterable[Union[Dict, pymatgen.core.Structure]]) – Iterable sequence of pymatgen structure dictionaries or pymatgen.core.Structure. Please confirm the dictionary representations of pymatgen.core.Structure from https://pymatgen.org/pymatgen.core.structure.html.
log_every_n (int, default 1000) – Logging messages reported every log_every_n samples.

Returns:

features – A numpy array containing a featurized representation of datapoints.

Return type:

np.ndarray

ComplexFeaturizer ¶

If you’re creating a new featurizer that featurizes a pair of ligand molecules and proteins, you will want to inherit from the abstract ComplexFeaturizer base class. This featurizer can take a pair of PDB or SDF files which contain ligand molecules and proteins.

class ComplexFeaturizer[source]¶

” Abstract class for calculating features for mol/protein complexes.

featurize(datapoints: Iterable[Tuple[str, str]] | None = None, log_every_n: int = 100, **kwargs) → ndarray[source]¶

Calculate features for mol/protein complexes. :param datapoints: List of filenames (PDB, SDF, etc.) for ligand molecules and proteins.

Each element should be a tuple of the form (ligand_filename, protein_filename).

Returns:: features – Array of features
Return type:: np.ndarray

PolymerFeaturizer ¶

If you’re creating a new featurizer that featurizes polymer material, you will want to inherit from the abstract PolymerFeaturizer base class. This featurizer can take a single string representation or datapoints of the same.

class PolymerFeaturizer[source]¶

Abstract class for calculating features for polymer materials.

The PolymerFeaturzer is responsibe for conversion of different polymer representations to features. The child classes can following representations for feature conversions.

Weighted Directed Graph Representation (Monomer SMILES + Fragments + Weight Distrbution)
BigSMILES String Representation

This polymer base class is useful considering it handles batches and validates the individual data points before passing it for featurization. Currently it only validates the string type for above representations.

Child classes need to implement the _featurize method for calculating features for a polymer.

Example

>>> from deepchem.feat import PolymerFeaturizer
>>> class MyPolymerFeaturizer(PolymerFeaturizer):
...     def _featurize(self, datapoint):
...         # Implement your featurization logic here
...         pass
>>> featurizer = MyPolymerFeaturizer()
>>> features = featurizer.featurize(['CCC'])

Note

The subclasses of this class require RDKit to be installed.

featurize(datapoints: Iterable[Any], log_every_n: int = 1000, **kwargs) → ndarray[source]¶

Calculate features for polymers.

Parameters:

datapoints (BigSMILES Strings / Iterable of BigSMILES Strings) –
Objects (Weighted Directed Graph Objects / Iterable of Weighted Directed Graph) –
log_every_n (int, default 1000) – Logging messages reported every log_every_n samples.

Returns:

features – A numpy array containing a featurized representation of datapoints.

Return type:

np.ndarray

VocabularyBuilder ¶

If you’re creating a vocabulary builder for generating vocabulary from a corpus or input data, the vocabulary builder must inhere from VocabularyBuilder base class.

class VocabularyBuilder[source]¶

Abstract class for building a vocabulary from a dataset.

build(dataset: Dataset)[source]¶

Builds vocabulary from a dataset

Parameters:: dataset (Dataset) – dataset to build vocabulary from.

classmethod load(fname: str)[source]¶

Loads vocabulary from the specified file

Parameters:: fname (str) – Path containing pre-build vocabulary.

save(fname: str)[source]¶

Dump vocabulary to the specified file.

Parameters:: fname (str) – A json file fname to save vocabulary.

extend(dataset: Dataset)[source]¶

Extends vocabulary from a dataset

Parameters:: dataset (Dataset) – dataset used for extending vocabulary

HuggingFaceVocabularyBuilder ¶

A wrapper class for building vocabulary from algorithms implemented in tokenizers library.

hf_vocab[source]¶: alias of <module ‘deepchem.feat.vocabulary_builders.hf_vocab’ from ‘/home/docs/checkouts/readthedocs.org/user_builds/deepchem/checkouts/latest/deepchem/feat/vocabulary_builders/hf_vocab.py’>