DeepChem contains an extensive collection of featurizers. If you
haven’t run into this terminology before, a “featurizer” is chunk of
code which transforms raw input data into a processed form suitable
for machine learning. Machine learning methods often need data to be
pre-chewed for them to process. Think of this like a mama penguin
chewing up food so the baby penguin can digest it easily.
Now if you’ve watched a few introductory deep learning lectures, you
might ask, why do we need something like a featurizer? Isn’t part of
the promise of deep learning that we can learn patterns directly from
raw data?
Unfortunately it turns out that deep learning techniques need
featurizers just like normal machine learning methods do. Arguably,
they are less dependent on sophisticated featurizers and more capable
of learning sophisticated patterns from simpler data. But
nevertheless, deep learning systems can’t simply chew up raw files.
For this reason, deepchem provides an extensive collection of
featurization methods which we will review on this page.
We are simplifying our graph convolution models by a joint data representation (GraphData)
in a future version of DeepChem, so we provide several featurizers.
ConvMolFeaturizer and WeaveFeaturizer are used
with graph convolution models which inherited KerasModel.
ConvMolFeaturizer is used with graph convolution models
except WeaveModel. WeaveFeaturizer are only used with WeaveModel.
On the other hand, MolGraphConvFeaturizer is used
with graph convolution models which inherited TorchModel.
MolGanFeaturizer will be used with MolGAN model,
a GAN model for generation of small molecules.
This class implements the featurization to implement Duvenaud graph convolutions.
Duvenaud graph convolutions [1]_ construct a vector of descriptors for each
atom in a molecule. The featurizer computes that vector of local descriptors.
Examples
>>> importdeepchemasdc>>> smiles=["C","CCC"]>>> featurizer=dc.feat.ConvMolFeaturizer(per_atom_fragmentation=False)>>> f=featurizer.featurize(smiles)>>> # Using ConvMolFeaturizer to create featurized fragments derived from molecules of interest.... # This is used only in the context of performing interpretation of models using atomic... # contributions (atom-based model interpretation)... smiles=["C","CCC"]>>> featurizer=dc.feat.ConvMolFeaturizer(per_atom_fragmentation=True)>>> f=featurizer.featurize(smiles)>>> len(f)# contains 2 lists with featurized fragments from 2 mols2
master_atom (Boolean) – if true create a fake atom with bonds to every other atom.
the initialization is the mean of the other atom features in
the molecule. This technique is briefly discussed in
Neural Message Passing for Quantum Chemistry
https://arxiv.org/pdf/1704.01212.pdf
use_chirality (Boolean) – if true then make the resulting atom features aware of the
chirality of the molecules in question
atom_properties (list of string or None) – properties in the RDKit Mol object to use as additional
atom-level features in the larger molecular feature. If None,
then no atom-level properties are used. Properties should be in the
RDKit mol object should be in the form
atom XXXXXXXX NAME
where XXXXXXXX is a zero-padded 8 digit number coresponding to the
zero-indexed atom index of each atom and NAME is the name of the property
provided in atom_properties. So “atom 00000000 sasa” would be the
name of the molecule level property in mol where the solvent
accessible surface area of atom 0 would be stored.
per_atom_fragmentation (Boolean) –
If True, then multiple “atom-depleted” versions of each molecule will be created (using featurize() method).
For each molecule, atoms are removed one at a time and the resulting molecule is featurized.
The result is a list of ConvMol objects,
one with each heavy atom removed. This is useful for subsequent model interpretation: finding atoms
favorable/unfavorable for (modelled) activity. This option is typically used in combination
with a FlatteningTransformer to split the lists into separate samples.
Since ConvMol is an object and not a numpy array, need to set dtype to
object.
This class implements the featurization to implement Weave convolutions.
Weave convolutions were introduced in [1]_. Unlike Duvenaud graph
convolutions, weave convolutions require a quadratic matrix of interaction
descriptors for each pair of atoms. These extra descriptors may provide for
additional descriptive power but at the cost of a larger featurized dataset.
graph_distance (bool, (default True)) – If True, use graph distance for distance features. Otherwise, use
Euclidean distance. Note that this means that molecules that this
featurizer is invoked on must have valid conformer information if this
option is set.
explicit_H (bool, (default False)) – If true, model hydrogens in the molecule.
use_chirality (bool, (default False)) – If true, use chiral information in the featurization
max_pair_distance (Optional[int], (default None)) – This value can be a positive integer or None. This
parameter determines the maximum graph distance at which pair
features are computed. For example, if max_pair_distance==2,
then pair features are computed only for atoms at most graph
distance 2 apart. If max_pair_distance is None, all pairs are
considered (effectively infinite max_pair_distance)
Featurizer for MolGAN de-novo molecular generation [1]_.
The default representation is in form of GraphMatrix object.
It is wrapper for two matrices containing atom and bond type information.
The class also provides reverse capabilities.
max_atom_count (int, default 9) – Maximum number of atoms used for creation of adjacency matrix.
Molecules cannot have more atoms than this number
Implicit hydrogens do not count.
kekulize (bool, default True) – Should molecules be kekulized.
Solves number of issues with defeaturization when used.
bond_labels (List[RDKitBond]) – List of types of bond used for generation of adjacency matrix
atom_labels (List[int]) – List of atomic numbers used for generation of node features
This class is a featurizer of general graph convolution networks for molecules.
The default node(atom) and edge(bond) representations are based on
WeaveNet paper. If you want to use your own representations,
you could use this class as a guide to define your original Featurizer. In many cases, it’s enough
to modify return values of construct_atom_feature or construct_bond_feature.
The default node representation are constructed by concatenating the following values,
and the feature length is 30.
Atom type: A one-hot vector of this atom, “C”, “N”, “O”, “F”, “P”, “S”, “Cl”, “Br”, “I”, “other atoms”.
Formal charge: Integer electronic charge.
Hybridization: A one-hot vector of “sp”, “sp2”, “sp3”.
Hydrogen bonding: A one-hot vector of whether this atom is a hydrogen bond donor or acceptor.
Aromatic: A one-hot vector of whether the atom belongs to an aromatic ring.
Degree: A one-hot vector of the degree (0-5) of this atom.
Number of Hydrogens: A one-hot vector of the number of hydrogens (0-4) that this atom connected.
Chirality: A one-hot vector of the chirality, “R” or “S”. (Optional)
use_edges (bool, default False) – Whether to use edge features or not.
use_chirality (bool, default False) – Whether to use chirality information or not.
If True, featurization becomes slow.
use_partial_charge (bool, default False) – Whether to use partial charge data or not.
If True, this featurizer computes gasteiger charges.
Therefore, there is a possibility to fail to featurize for some molecules
and featurization becomes slow.
This class is a featuriser of PAGTN graph networks for molecules.
The featurization is based on PAGTN model. It is
slightly more computationally intensive than default Graph Convolution Featuriser, but it
builds a Molecular Graph connecting all atom pairs accounting for interactions of an atom with
every other atom in the Molecule. According to the paper, interactions between two pairs
of atom are dependent on the relative distance between them and and hence, the function needs
to calculate the shortest path between them.
The default node representation is constructed by concatenating the following values,
and the feature length is 94.
Atom type: One hot encoding of the atom type. It consists of the most possible elements in a chemical compound.
Formal charge: One hot encoding of formal charge of the atom.
Degree: One hot encoding of the atom degree
Explicit Valence: One hot encoding of explicit valence of an atom. The supported possibilities
include 0-6.
Implicit Valence: One hot encoding of implicit valence of an atom. The supported possibilities
include 0-5.
Aromaticity: Boolean representing if an atom is aromatic.
The default edge representation is constructed by concatenating the following values,
and the feature length is 42. It builds a complete graph where each node is connected to
every other node. The edge representations are calculated based on the shortest path between two nodes
(choose any one if multiple exist). Each bond encountered in the shortest path is used to
calculate edge features.
Bond type: A one-hot vector of the bond type, “single”, “double”, “triple”, or “aromatic”.
Conjugated: A one-hot vector of whether this bond is conjugated or not.
Same ring: A one-hot vector of whether the atoms in the pair are in the same ring.
Ring Size and Aromaticity: One hot encoding of atoms in pair based on ring size and aromaticity.
Distance: One hot encoding of the distance between pair of atoms.
max_length (int) – Maximum distance up to which shortest paths must be considered.
Paths shorter than max_length will be padded and longer will be
truncated, default to 5.
The Grover Featurizer is used to compute features suitable for grover model.
It accepts an rdkit molecule of type rdkit.Chem.rdchem.Mol or a SMILES string
as input and computes the following sets of features:
a molecular graph from the input molecule
functional groups which are used only during pretraining
additional features which can only be used during finetuning
Parameters:
additional_featurizer (dc.feat.Featurizer) – Given a molecular dataset, it is possible to extract additional molecular features in order
can (to train and finetune from the existing pretrained model. The additional_featurizer) –
molecule. (be used to generate additional features for the) –
A featurizer that featurizes an RDKit mol object as a GraphData object with 3D coordinates. The 3D coordinates are represented in the node_pos_features attribute of the GraphData object of shape [num_atoms * num_conformers, 3].
We are not explitly handling hydrogen atoms for now. We only support ‘H’, ‘C’, ‘N’, ‘O’ and ‘F’ atoms to be present in the smiles at this point for MXMNet Model.
features – List of length 6. The i-th value in this list provides the index of the
atom in the corresponding feature value list. The 6 feature values lists
for this function are [GraphConvConstants.possible_atom_list,
GraphConvConstants.possible_numH_list,
GraphConvConstants.possible_valence_list,
GraphConvConstants.possible_formal_charge_list,
GraphConvConstants.possible_num_radical_e_list].
a1 (RDKit atom) – The source atom to compute distances from.
num_atoms (int) – The total number of atoms.
bond_adj_list (list of lists) – bond_adj_list[i] is a list of the atom indices that atom i shares a
bond with. This list is symmetrical so if j in bond_adj_list[i] then i in
bond_adj_list[j].
max_distance (int, optional (default 7)) – The max distance to search.
Returns:
distances – Of shape (num_atoms, max_distance). Provides a one-hot encoding of the
distances. That is, distances[i] is a one-hot encoding of the distance
from a1 to atom i.
Return type:
np.ndarray
This function is important and computes per-atom feature vectors used by
graph convolutional featurizers.
Helper method used to compute atom pair feature vectors.
Many different featurization methods compute atom pair features
such as WeaveFeaturizer. Note that atom pair features could be
for pairs of atoms which aren’t necessarily bonded to one
another.
Parameters:
mol (RDKit Mol) – Molecule to compute features on.
bond_features_map (dict) – Dictionary that maps pairs of atom ids (say (2, 3) for a bond between
atoms 2 and 3) to the features for the bond between them.
bond_adj_list (list of lists) – bond_adj_list[i] is a list of the atom indices that atom i shares a
bond with . This list is symmetrical so if j in bond_adj_list[i] then i
in bond_adj_list[j].
bt_len (int, optional (default 6)) – The number of different bond types to consider.
graph_distance (bool, optional (default True)) – If true, use graph distance between molecules. Else use euclidean
distance. The specified mol must have a conformer. Atomic
positions will be retrieved by calling mol.getConformer(0).
max_pair_distance (Optional[int], (default None)) – This value can be a positive integer or None. This
parameter determines the maximum graph distance at which pair
features are computed. For example, if max_pair_distance==2,
then pair features are computed only for atoms at most graph
distance 2 apart. If max_pair_distance is None, all pairs are
considered (effectively infinite max_pair_distance)
Note
This method requires RDKit to be installed.
Returns:
features (np.ndarray) – Of shape (N_edges, bt_len + max_distance + 1). This is the array
of pairwise features for all atom pairs, where N_edges is the
number of edges within max_pair_distance of one another in this
molecules.
pair_edges (np.ndarray) – Of shape (2, num_pairs) where num_pairs is the total number of
pairs within max_pair_distance of one another.
This class is a featurizer for the Molecule Attention Transformer [1]_.
The returned value is a numpy array which consists of molecular graph descriptions:
Processes an input RDKitMol further to be able to extract id-specific Conformers from it using mol.GetConformer().
Parameters:
mol (RDKitMol) – RDKit Mol object.
Returns:
mol – A processed RDKitMol object which is embedded, UFF Optimized and has Hydrogen atoms removed. If the former conditions are not met and there is a value error, then 2D Coordinates are computed instead.
Deepchem already contains an atom_features function, however we are defining a new one here due to the need to handle features specific to MAT.
Since we need new features like Atom GetNeighbors and IsInRing, and the number of features required for MAT is a fraction of what the Deepchem atom_features function computes, we can speed up computation by defining a custom function.
Extended Connectivity Circular Fingerprints compute a bag-of-words style
representation of a molecule by breaking it into local neighborhoods and
hashing into a bit vector of the specified size. It is used specifically
for structure-activity modelling. See [1]_ for more details.
This class convert molecules to vector representations by using Mol2Vec.
Mol2Vec is an unsupervised machine learning approach to learn vector representations
of molecular substructures and the algorithm is based on Word2Vec, which is
one of the most popular technique to learn word embeddings using neural network in NLP.
Please see the details from [1]_.
The Mol2Vec requires the pretrained model, so we use the model which is put on the mol2vec
github repository [2]_. The default model was trained on 20 million compounds downloaded
from ZINC using the following paramters.
radius 1
UNK to replace all identifiers that appear less than 4 times
pretrain_file (str, optional) – The path for pretrained model. If this value is None, we use the model which is put on
github repository (https://github.com/samoturk/mol2vec/tree/master/examples/models).
The model is trained on 20 million compounds downloaded from ZINC.
radius (int, optional (default 1)) – The fingerprint radius. The default value was used to train the model which is put on
github repository.
unseen (str, optional (default 'UNK')) – The string to used to replace uncommon words/identifiers while training.
This class computes a list of chemical descriptors like
molecular weight, number of valence electrons, maximum and
minimum partial charge, etc using RDKit.
This class can also compute normalized descriptors, if required.
(The implementation for normalization is based on RDKit2DNormalized() method
in ‘descriptastorus’ library.)
When the is_normalized option is set as True, descriptor values are normalized across the sample
by fitting a cumulative density function. CDFs were used as opposed to simpler scaling algorithms
mainly because CDFs have the useful property that ‘each value has the same meaning: the percentage
of the population observed below the raw feature value.’
Warning: Currently, the normalizing cdf parameters are not available for BCUT2D descriptors.
(BCUT2D_MWHI, BCUT2D_MWLOW, BCUT2D_CHGHI, BCUT2D_CHGLO, BCUT2D_LOGPHI, BCUT2D_LOGPLOW, BCUT2D_MRHI, BCUT2D_MRLOW)
descriptors (List[str] (default None)) – List of RDKit descriptors to compute properties. When None, computes values
arguments. (for descriptors which are chosen based on options set in other) –
use_fragment (bool, optional (default True)) – If True, the return value includes the fragment binary descriptors like ‘fr_XXX’.
ipc_avg (bool, optional (default True)) – If True, the IPC descriptor calculates with avg=True option.
Please see this issue: https://github.com/rdkit/rdkit/issues/1527.
is_normalized (bool, optional (default False)) – If True, the return value contains normalized features.
use_bcut2d (bool, optional (default True)) – If True, the return value includes the descriptors like ‘BCUT2D_XXX’.
labels_only (bool, optional (default False)) – Returns only the presence or absence of a group.
Notes
If both labels_only and is_normalized are True, then is_normalized takes
Coulomb matrices provide a representation of the electronic structure of
a molecule. For a molecule with N atoms, the Coulomb matrix is a
N X N matrix where each element gives the strength of the
electrostatic interaction between two atoms. The method is described
in more detail in [1]_.
SmilesToSeq Featurizer takes a SMILES string, and turns it into a sequence.
Details taken from [1]_.
SMILES strings smaller than a specified max length (max_len) are padded using
the PAD token while those larger than the max length are not considered. Based
on the paper, there is also the option to add extra padding (pad_len) on both
sides of the string after length normalization. Using a character to index (char_to_idx)
mapping, the SMILES characters are turned into indices and the
resulting sequence of indices serves as the input for an embedding layer.
SmilesToImage Featurizer takes a SMILES string, and turns it into an image.
Details taken from [1]_.
The default size of for the image is 80 x 80. Two image modes are currently
supported - std & engd. std is the gray scale specification,
with atomic numbers as pixel values for atom positions and a constant value of
2 for bond positions. engd is a 4-channel specification, which uses atom
properties like hybridization, valency, charges in addition to atomic number.
Bond type is also used for the bonds.
The coordinates of all atoms are computed, and lines are drawn between atoms
to indicate bonds. For the respective channels, the atom and bond positions are
set to the property values as mentioned in the paper.
Encodes any arbitrary string or molecule as a one-hot array.
This featurizer encodes the characters within any given string as a one-hot
array. It also works with RDKit molecules: it can convert RDKit molecules to
SMILES strings and then one-hot encode the characters in said strings.
charset (List[str] (default ZINC_CHARSET)) – A list of strings, where each string is length 1 and unique.
max_length (Optional[int], optional (default 100)) – The max length for string. If the length of string is shorter than
max_length, the string is padded using space.
This featurizer uses the sklearn OneHotEncoder to create
sparse matrix representation of a one-hot array of any string.
It is expected to be used in large datasets that produces memory overload
using standard featurizer such as OneHotFeaturizer. For example: SwissprotDataset
Encodes a molecule as a SMILES string or RDKit mol.
This featurizer can be useful when you’re trying to transform a large
collection of RDKit mol objects as Smiles strings, or alternatively as a
“no-op” featurizer in your molecular pipeline.
ecfp_power (int, optional (default 3)) – Number of bits to store ECFP features (resulting vector will be
2^ecfp_power long)
splif_power (int, optional (default 3)) – Number of bits to store SPLIF features (resulting vector will be
2^splif_power long)
box_width (float, optional (default 16.0)) – Size of a box in which voxel features are calculated. Box is centered on a
ligand centroid.
voxel_width (float, optional (default 1.0)) – Size of a 3D voxel in a grid.
flatten (bool, optional (defaul False)) – Indicate whether calculated features should be flattened. Output is always
flattened if flat features are specified in feature_types.
verbose (bool, optional (defaul True)) – Verbolity for logging
sanitize (bool, optional (defaul False)) – If set to True molecules will be sanitized. Note that calculating some
features (e.g. aromatic interactions) require sanitized molecules.
**kwargs (dict, optional) – Keyword arguments can be usaed to specify custom cutoffs and bins (see
default values below).
This class computes the featurization that corresponds to AtomicConvModel.
This class computes featurizations needed for AtomicConvModel.
Given two molecular structures, it computes a number of useful
geometric features. In particular, for each molecule and the global
complex, it computes a coordinates matrix of size (N_atoms, 3)
where N_atoms is the number of atoms. It also computes a
neighbor-list, a dictionary with N_atoms elements where
neighbor-list[i] is a list of the atoms the i-th atom has as
neighbors. In addition, it computes a z-matrix for the molecule
which is an array of shape (N_atoms,) that contains the atomic
number of that atom.
Since the featurization computes these three quantities for each of
the two molecules and the complex, a total of 9 quantities are
returned for each complex. Note that for efficiency, fragments of
the molecules can be provided rather than the full molecules
themselves.
frag1_num_atoms (int) – Maximum number of atoms in fragment 1.
frag2_num_atoms (int) – Maximum number of atoms in fragment 2.
complex_num_atoms (int) – Maximum number of atoms in complex of frag1/frag2 together.
max_num_neighbors (int) – Maximum number of atoms considered as neighbors.
neighbor_cutoff (float) – Maximum distance (angstroms) for two atoms to be considered as
neighbors. If more than max_num_neighbors atoms fall within
this cutoff, the closest max_num_neighbors will be used.
strip_hydrogens (bool (default True)) – Remove hydrogens before computing featurization.
Material Composition Featurizers are those that work with datasets of crystal
compositions with periodic boundary conditions.
For inorganic crystal structures, these featurizers operate on chemical
compositions (e.g. “MoS2”). They should be applied on systems that have
periodic boundary conditions. Composition featurizers are not designed
to work with molecules.
Fingerprint of elemental properties from composition.
Based on the data source chosen, returns properties and statistics
(min, max, range, mean, standard deviation, mode) for a compound
based on elemental stoichiometry. E.g., the average electronegativity
of atoms in a crystal structure. The chemical fingerprint is a
vector of these statistics. For a full list of properties and statistics,
see matminer.featurizers.composition.ElementProperty(data_source).feature_labels().
This featurizer requires the optional dependencies pymatgen and
matminer. It may be useful when only crystal compositions are available
(and not 3D coordinates).
Fixed size vector of length 86 containing raw fractional elemental
compositions in the compound. The 86 chosen elements are based on the
original implementation at https://github.com/NU-CUCIS/ElemNet.
Returns a vector containing fractional compositions of each element
in the compound.
Material Structure Featurizers are those that work with datasets of crystals with
periodic boundary conditions. For inorganic crystal structures, these
featurizers operate on pymatgen.Structure objects, which include a
lattice and 3D coordinates that specify a periodic crystal structure.
They should be applied on systems that have periodic boundary conditions.
Structure featurizers are not designed to work with molecules.
A variant of Coulomb matrix for periodic crystals.
The sine Coulomb matrix is identical to the Coulomb matrix, except
that the inverse distance function is replaced by the inverse of
sin**2 of the vector between sites which are periodic in the
dimensions of the crystal lattice.
Features are flattened into a vector of matrix eigenvalues by default
for ML-readiness. To ensure that all feature vectors are equal
length, the maximum number of atoms (eigenvalues) in the input
dataset must be specified.
This featurizer requires the optional dependencies pymatgen and
matminer. It may be useful when crystal structures with 3D coordinates
are available.
datapoints (Iterable[Union[Dict, pymatgen.core.Structure]]) – Iterable sequence of pymatgen structure dictionaries
or pymatgen.core.Structure. Please confirm the dictionary representations
of pymatgen.core.Structure from https://pymatgen.org/pymatgen.core.structure.html.
Based on the implementation in Crystal Graph Convolutional
Neural Networks (CGCNN). The method constructs a crystal graph
representation including atom features and bond features (neighbor
distances). Neighbors are determined by searching in a sphere around
atoms in the unit cell. A Gaussian filter is applied to neighbor distances.
All units are in angstrom.
This featurizer requires the optional dependency pymatgen. It may
be useful when 3D coordinates are available and when using graph
network models and crystal graph convolutional networks.
datapoints (Iterable[Union[Dict, pymatgen.core.Structure]]) – Iterable sequence of pymatgen structure dictionaries
or pymatgen.core.Structure. Please confirm the dictionary representations
of pymatgen.core.Structure from https://pymatgen.org/pymatgen.core.structure.html.
Calculates the 2-D Surface graph features in 6 different permutations-
Based on the implementation of Lattice Graph Convolution Neural
Network (LCNN). This method produces the Atom wise features ( One Hot Encoding)
and Adjacent neighbour in the specified order of permutations. Neighbors are
determined by first extracting a site local environment from the primitive cell,
and perform graph matching and distance matching to find neighbors.
First, the template of the Primitive cell needs to be defined along with periodic
boundary conditions and active and spectator site details. structure(Data Point
i.e different configuration of adsorbate atoms) is passed for featurization.
This particular featurization produces a regular-graph (equal number of Neighbors)
along with its permutation in 6 symmetric axis. This transformation can be
applied when orderering of neighboring of nodes around a site play an important role
in the propert predictions. Due to consideration of local neighbor environment,
this current implementation would be fruitful in finding neighbors for calculating
formation energy of adbsorption tasks where the local. Adsorption turns out to be important
in many applications such as catalyst and semiconductor design.
The permuted neighbors are calculated using the Primitive cells i.e periodic cells
in all the data points are built via lattice transformation of the primitive cell.
Primitive cell Format:
Pymatgen structure object with site_properties key value
“SiteTypes” mentioning if it is a active site “A1” or spectator
site “S1”.
ns , the number of spectator types elements. For “S1” its 1.
na , the number of active types elements. For “A1” its 1.
aos, the different species of active elements “A1”.
pbc, the periodic boundary conditions.
Data point Structure Format(Configuration of Atoms):
Pymatgen structure object with site_properties with following key value.
“SiteTypes”, mentioning if it is a active site “A1” or spectator
site “S1”.
“oss”, different occupational sites. For spectator sites make it -1.
It is highly recommended that cells of data are directly redefined from
the primitive cell, specifically, the relative coordinates between sites
are consistent so that the lattice is non-deviated.
structure (: PymatgenStructure) – Pymatgen Structure object of the primitive cell used for calculating
neighbors from lattice transformations.It also requires site_properties
attribute with “Sitetypes”(Active or spectator site).
aos (List[str]) – A list of all the active site species. For the Pt, N, NO configuration
set it as [‘0’, ‘1’, ‘2’]
pbc (List[bool]) – Periodic Boundary Condition
ns (int (default 1)) – The number of spectator types elements. For “S1” its 1.
na (int (default 1)) – the number of active types elements. For “A1” its 1.
cutoff (float (default 6.00)) – Cutoff of radius for getting local environment.Only
used down to 2 digits.
datapoints (Iterable[Union[Dict, pymatgen.core.Structure]]) – Iterable sequence of pymatgen structure dictionaries
or pymatgen.core.Structure. Please confirm the dictionary representations
of pymatgen.core.Structure from https://pymatgen.org/pymatgen.core.structure.html.
Featurizes SAM files, that store biological sequences aligned to a reference
sequence. This class extracts Query Name, Query Sequence, Query Length,
Reference Name,Reference Start, CIGAR and Mapping Quality of each read in
a SAM file.
Examples
>>> fromdeepchem.data.data_loaderimportSAMLoader>>> importdeepchemasdc>>> inputs='deepchem/data/tests/example.sam'>>> featurizer=dc.feat.SAMFeaturizer()>>> features=featurizer.featurize(inputs)Information for each read is stored in a 'numpy.ndarray'.>>> type(features[0])<class 'numpy.ndarray'>
This is the default featurizer used by SAMLoader, and it extracts the following
fields from each read in each SAM file in the given order:-
- Column 0: Query Name
- Column 1: Query Sequence
- Column 2: Query Length
- Column 3: Reference Name
- Column 4: Reference Start
- Column 5: CIGAR
- Column 6: Mapping Quality
For the given example, to extract specific features, we do the following.
>>> features[0][0] # Query Name
r001
>>> features[0][1] # Query Sequence
TTAGATAAAGAGGATACTG
>>> features[0][2] # Query Length
19
>>> features[0][3] # Reference Name
ref
>>> features[0][4] # Reference Start
6
>>> features[0][5] # CIGAR
[(0, 8), (1, 4), (0, 4), (2, 1), (0, 3)]
>>> features[0][6] # Mapping Quality
30
Note
This class requires pysam to be installed. Pysam can be used with Linux or MacOS X.
To use Pysam on Windows, use Windows Subsystem for Linux(WSL).
datapoints (Iterable[Any]) – A sequence of objects that you’d like to featurize. Subclassses of
Featurizer should instantiate the _featurize method that featurizes
objects in the sequence.
A tokenizer is in charge of preparing the inputs for a natural language processing model.
For many scientific applications, it is possible to treat inputs as “words”/”sentences” and
use NLP methods to make meaningful predictions. For example, SMILES strings or DNA sequences
have grammatical structure and can be usefully modeled with NLP techniques. DeepChem provides
some scientifically relevant tokenizers for use in different applications. These tokenizers are
based on those from the Huggingface transformers library (which DeepChem tokenizers inherit from).
The base classes PreTrainedTokenizer and PreTrainedTokenizerFast implements the common methods
for encoding string inputs in model inputs and instantiating/saving python tokenizers
either from a local file or directory or from a pretrained tokenizer provided by the library
(downloaded from HuggingFace’s AWS S3 repository).
Tokenizing (spliting strings in sub-word token strings), converting tokens strings to ids and back, and encoding/decoding (i.e. tokenizing + convert to integers)
Adding new tokens to the vocabulary in a way that is independent of the underlying structure (BPE, SentencePiece…)
Managing special tokens like mask, beginning-of-sentence, etc tokens (adding them, assigning them to attributes in the tokenizer for easy access and making sure they are not split during tokenization)
BatchEncoding holds the output of the tokenizer’s encoding methods
(__call__, encode_plus and batch_encode_plus) and is derived from a Python dictionary.
When the tokenizer is a pure python tokenizer, this class behave just like a standard python dictionary
and hold the various model inputs computed by these methodes (input_ids, attention_mask…).
For more details on the base tokenizers which the DeepChem tokenizers inherit from,
please refer to the following: HuggingFace tokenizers docs
Tokenization methods on string-based corpuses in the life sciences are
becoming increasingly popular for NLP-based applications to chemistry and biology.
One such example is ChemBERTa, a transformer for molecular property prediction.
DeepChem offers a tutorial for utilizing ChemBERTa using an alternate tokenizer,
a Byte-Piece Encoder, which can be found here.
The dc.feat.SmilesTokenizer module inherits from the BertTokenizer class in transformers.
It runs a WordPiece tokenization algorithm over SMILES strings using the tokenisation SMILES regex developed by Schwaller et. al.
The SmilesTokenizer employs an atom-wise tokenization strategy using the following Regex expression:
Creates the SmilesTokenizer class. The tokenizer heavily inherits from the BertTokenizer
implementation found in Huggingface’s transformers library. It runs a WordPiece tokenization
algorithm over SMILES strings using the tokenisation SMILES regex developed by Schwaller et. al.
vocab_path (obj: str) – The directory in which to save the SMILES character per line vocabulary file.
Default vocab file is found in deepchem/feat/tests/data/vocab.txt
Returns:
vocab_file – Paths to the files saved.
typle with string to a SMILES character per line vocabulary file.
Default vocab file is found in deepchem/feat/tests/data/vocab.txt
The dc.feat.BasicSmilesTokenizer module uses a regex tokenization pattern to tokenise SMILES strings.
The regex is developed by Schwaller et. al. The tokenizer is to be used on SMILES in cases
where the user wishes to not rely on the transformers API.
Run basic SMILES tokenization using a regex pattern developed by Schwaller et. al.
This tokenizer is to be used when a tokenizer that does not require the transformers library by HuggingFace is required.
Wrapper class that wraps HuggingFace tokenizers as DeepChem featurizers
The HuggingFaceFeaturizer wrapper provides a wrapper
around Hugging Face tokenizers allowing them to be used as DeepChem
featurizers. This might be useful in scenarios where user needs to use
a hugging face tokenizer when loading a dataset.
datapoints (Iterable[Any]) – A sequence of objects that you’d like to featurize. Subclassses of
Featurizer should instantiate the _featurize method that featurizes
objects in the sequence.
Tokenizers uses a vocabulary to tokenize the datapoint. To build a vocabulary, an algorithm which generates vocabulary from a corpus is required. A corpus is usually a collection of molecules, DNA sequences etc. DeepChem provides the following algorithms to build vocabulary from a corpus. A vocabulary builder is not a featurizer. It is an utility which helps the tokenizers to featurize datapoints.
This module can be used to generate atom vocabulary from SMILES strings for
the GROVER pretraining task. For each atom in a molecule, the vocabulary context is the
node-edge-count of the atom where node is the neighboring atom, edge is the type of bond (single
bond or double bound) and count is the number of such node-edge pairs for the atom in its
neighborhood. For example, for the molecule ‘CC(=O)C’, the context of the first carbon atom is
C-SINGLE1 because it’s neighbor is C atom, the type of bond is SINGLE bond and the count of such
bonds is 1. The context of the second carbon atom is C-SINGLE2 and O-DOUBLE1 because
it is connected to two carbon atoms by a single bond and 1 O atom by a double bond.
The vocabulary of an atom is then computed as the atom-symbol_contexts where the contexts
are sorted in alphabetical order when there are multiple contexts. For example, the
vocabulary of second C is C_C-SINGLE2_O-DOUBLE1. The algorithm enumerates vocabulary of all atoms
in the dataset and makes a vocabulary to index mapping by sorting the vocabulary
by frequency and then alphabetically.
The algorithm enumerates vocabulary of all atoms in the dataset and makes a vocabulary to
index mapping by sorting the vocabulary by frequency and then alphabetically. The max_size
parameter can be used for setting the size of the vocabulary. When this parameter is set,
the algorithm stops adding new words to the index when the vocabulary size reaches max_size.
Parameters:
max_size (int (optional)) – Maximum size of vocabulary
This module can be used to generate atom vocabulary from SMILES strings for
the GROVER pretraining task. For each atom in a molecule, the vocabulary context is the
node-edge-count of the atom where node is the neighboring atom, edge is the type of bond (single
bond or double bound) and count is the number of such node-edge pairs for the atom in its
neighborhood. For example, for the molecule ‘CC(=O)C’, the context of the first carbon atom is
C-SINGLE1 because it’s neighbor is C atom, the type of bond is SINGLE bond and the count of such
bonds is 1. The context of the second carbon atom is C-SINGLE2 and O-DOUBLE1 because
it is connected to two carbon atoms by a single bond and 1 O atom by a double bond.
The vocabulary of an atom is then computed as the atom-symbol_contexts where the contexts
are sorted in alphabetical order when there are multiple contexts. For example, the
vocabulary of second C is C_C-SINGLE2_O-DOUBLE1. The algorithm enumerates vocabulary of all atoms
in the dataset and makes a vocabulary to index mapping by sorting the vocabulary
by frequency and then alphabetically.
The algorithm enumerates vocabulary of all atoms in the dataset and makes a vocabulary to
index mapping by sorting the vocabulary by frequency and then alphabetically. The max_size
parameter can be used for setting the size of the vocabulary. When this parameter is set,
the algorithm stops adding new words to the index when the vocabulary size reaches max_size.
Parameters:
max_size (int (optional)) – Maximum size of vocabulary
The dc.feat.PFMFeaturizer module implements a featurizer for position frequency matrices.
This takes in a list of multisequence alignments and returns a list of position frequency matrices.
Encodes a list position frequency matrices for a given list of multiple sequence alignments
The default character set is 25 amino acids. If you want to use a different character set, such as nucleotides, simply pass in
a list of character strings in the featurizer constructor.
The max_length parameter is the maximum length of the sequences to be featurized. If you want to featurize longer sequences, modify the
max_length parameter in the featurizer constructor.
The final row in the position frequency matrix is the unknown set, if there are any characters which are not included in the charset.
Bert Featurizer.
The Bert Featurizer is a wrapper class for HuggingFace’s BertTokenizerFast.
This class intends to allow users to use the BertTokenizer API while
remaining inside the DeepChem ecosystem.
Examples
>>> fromdeepchem.featimportBertFeaturizer>>> fromtransformersimportBertTokenizerFast>>> tokenizer=BertTokenizerFast.from_pretrained("Rostlab/prot_bert",do_lower_case=False)>>> featurizer=BertFeaturizer(tokenizer)>>> feats=featurizer.featurize(['D L I P [MASK] L V T'])
Notes
Examples are based on RostLab’s ProtBert documentation.
datapoints (Iterable[Any]) – A sequence of objects that you’d like to featurize. Subclassses of
Featurizer should instantiate the _featurize method that featurizes
objects in the sequence.
The Roberta Featurizer is a wrapper class of the Roberta Tokenizer,
which is used by Huggingface’s transformers library for tokenizing large corpuses for Roberta Models.
Please confirm the details in [1]_.
This class requires transformers to be installed.
RobertaFeaturizer uses dual inheritance with RobertaTokenizerFast in Huggingface for rapid tokenization,
as well as DeepChem’s MolecularFeaturizer class.
Add a dictionary of special tokens (eos, pad, cls, etc.) to the encoder and link them to class attributes. If
special tokens are NOT in the vocabulary, they are added to it (indexed starting from the last index of the
current vocabulary).
When adding new tokens to the vocabulary, you should make sure to also resize the token embedding matrix of the
model so that its embedding matrix matches the tokenizer.
In order to do that, please use the [~PreTrainedModel.resize_token_embeddings] method.
Using add_special_tokens will ensure your special tokens can be used in several ways:
Special tokens can be skipped when decoding using skip_special_tokens = True.
Special tokens are carefully handled by the tokenizer (they are never split), similar to AddedTokens.
You can easily refer to special tokens using tokenizer class attributes like tokenizer.cls_token. This
makes it easy to develop model-agnostic training and fine-tuning scripts.
When possible, special tokens are already registered for provided pretrained models (for instance
[BertTokenizer] cls_token is already registered to be :obj*’[CLS]’* and XLM’s one is also registered to be
‘</s>’).
Parameters:
special_tokens_dict (dictionary str to str or tokenizers.AddedToken) –
Keys should be in the list of predefined special attributes: [bos_token, eos_token, unk_token,
sep_token, pad_token, cls_token, mask_token, additional_special_tokens].
Tokens are only added if they are not already in the vocabulary (tested by checking if the tokenizer
assign the index of the unk_token to them).
replace_additional_special_tokens (bool, optional,, defaults to True) – If True, the existing list of additional special tokens will be replaced by the list provided in
special_tokens_dict. Otherwise, self._additional_special_tokens is just extended. In the former
case, the tokens will NOT be removed from the tokenizer’s full vocabulary - they are only being flagged
as non-special tokens. Remember, this only affects which tokens are skipped during decoding, not the
added_tokens_encoder and added_tokens_decoder. This means that the previous
additional_special_tokens are still added tokens, and will not be split by the model.
Returns:
Number of tokens added to the vocabulary.
Return type:
int
Examples:
```python
# Let’s see how to add a new classification token to GPT-2
tokenizer = GPT2Tokenizer.from_pretrained(“gpt2”)
model = GPT2Model.from_pretrained(“gpt2”)
special_tokens_dict = {“cls_token”: “<CLS>”}
num_added_toks = tokenizer.add_special_tokens(special_tokens_dict)
print(“We have added”, num_added_toks, “tokens”)
# Notice: resize_token_embeddings expect to receive the full size of the new vocabulary, i.e., the length of the tokenizer.
model.resize_token_embeddings(len(tokenizer))
Add a list of new tokens to the tokenizer class. If the new tokens are not in the vocabulary, they are added to
it with indices starting from length of the current vocabulary and and will be isolated before the tokenization
algorithm is applied. Added tokens and tokens from the vocabulary of the tokenization algorithm are therefore
not treated in the same way.
Note, when adding new tokens to the vocabulary, you should make sure to also resize the token embedding matrix
of the model so that its embedding matrix matches the tokenizer.
In order to do that, please use the [~PreTrainedModel.resize_token_embeddings] method.
Parameters:
new_tokens (str, tokenizers.AddedToken or a list of str or tokenizers.AddedToken) – Tokens are only added if they are not already in the vocabulary. tokenizers.AddedToken wraps a string
token to let you personalize its behavior: whether this token should only match against a single word,
whether this token should strip all potential whitespaces on the left side, whether this token should
strip all potential whitespaces on the right side, etc.
special_tokens (bool, optional, defaults to False) –
Can be used to specify if the token is a special token. This mostly change the normalization behavior
(special tokens like CLS or [MASK] are usually not lower-cased for instance).
See details for tokenizers.AddedToken in HuggingFace tokenizers library.
Returns:
Number of tokens added to the vocabulary.
Return type:
int
Examples:
```python
# Let’s see how to increase the vocabulary of Bert model and tokenizer
tokenizer = BertTokenizerFast.from_pretrained(“bert-base-uncased”)
model = BertModel.from_pretrained(“bert-base-uncased”)
num_added_toks = tokenizer.add_tokens([“new_tok1”, “my_new-tok2”])
print(“We have added”, num_added_toks, “tokens”)
# Notice: resize_token_embeddings expect to receive the full size of the new vocabulary, i.e., the length of the tokenizer.
model.resize_token_embeddings(len(tokenizer))
```
Returns the sorted mapping from string to index. The added tokens encoder is cached for performance
optimisation in self._added_tokens_encoder for the slow tokenizers.
All the special tokens (‘<unk>’, ‘<cls>’, etc.), the order has
nothing to do with the index of each tokens. If you want to know the correct indices, check
self.added_tokens_encoder. We can’t create an order anymore as the keys are AddedTokens and not Strings.
Don’t convert tokens of tokenizers.AddedToken type to string so they can be used to control more finely how
special tokens are tokenized.
Converts a Conversation object or a list of dictionaries with “role” and “content” keys to a list of token
ids. This method is intended for use with chat models, and will read the tokenizer’s chat_template attribute to
determine the format and control tokens to use when converting. When chat_template is None, it will fall back
to the default_chat_template specified at the class level.
Parameters:
conversation (Union[List[Dict[str, str]], "Conversation"]) – A Conversation object or list of dicts
with “role” and “content” keys, representing the chat history so far.
chat_template (str, optional) – A Jinja template to use for this conversion. If
this is not passed, the model’s default chat template will be used instead.
add_generation_prompt (bool, optional) – Whether to end the prompt with the token(s) that indicate
the start of an assistant message. This is useful when you want to generate a response from the model.
Note that this argument will be passed to the chat template, and so it must be supported in the
template for this argument to have any effect.
tokenize (bool, defaults to True) – Whether to tokenize the output. If False, the output will be a string.
padding (bool, defaults to False) – Whether to pad sequences to the maximum length. Has no effect if tokenize is False.
truncation (bool, defaults to False) – Whether to truncate sequences at the maximum length. Has no effect if tokenize is False.
max_length (int, optional) – Maximum length (in tokens) to use for padding or truncation. Has no effect if tokenize is False. If
not specified, the tokenizer’s max_length attribute will be used as a default.
return_tensors (str or [~utils.TensorType], optional) – If set, will return tensors of a particular framework. Has no effect if tokenize is False. Acceptable
values are:
- ‘tf’: Return TensorFlow tf.Tensor objects.
- ‘pt’: Return PyTorch torch.Tensor objects.
- ‘np’: Return NumPy np.ndarray objects.
- ‘jax’: Return JAX jnp.ndarray objects.
**tokenizer_kwargs – Additional kwargs to pass to the tokenizer.
Returns:
A list of token ids representing the tokenized chat so far, including control tokens. This
output is ready to pass to the model, either directly or via methods like generate().
Temporarily sets the tokenizer for encoding the targets. Useful for tokenizer associated to
sequence-to-sequence models that need a slightly different processing for the labels.
Convert a list of lists of token ids into a list of strings by calling decode.
Parameters:
sequences (Union[List[int], List[List[int]], np.ndarray, torch.Tensor, tf.Tensor]) – List of tokenized input ids. Can be obtained using the __call__ method.
skip_special_tokens (bool, optional, defaults to False) – Whether or not to remove special tokens in the decoding.
clean_up_tokenization_spaces (bool, optional) – Whether or not to clean up the tokenization spaces. If None, will default to
self.clean_up_tokenization_spaces.
kwargs (additional keyword arguments, optional) – Will be passed to the underlying model specific decode method.
Tokenize and prepare for the model a list of sequences or a list of pairs of sequences.
<Tip warning={true}>
This method is deprecated, __call__ should be used instead.
</Tip>
Parameters:
batch_text_or_text_pairs (List[str], List[Tuple[str, str]], List[List[str]], List[Tuple[List[str], List[str]]], and for not-fast tokenizers, also List[List[int]], List[Tuple[List[int], List[int]]]) – Batch of sequences or pair of sequences to be encoded. This can be a list of
string/string-sequences/int-sequences or a list of pair of string/string-sequences/int-sequence (see
details in encode_plus).
add_special_tokens (bool, optional, defaults to True) – Whether or not to add special tokens when encoding the sequences. This will use the underlying
PretrainedTokenizerBase.build_inputs_with_special_tokens function, which defines which tokens are
automatically added to the input ids. This is usefull if you want to add bos or eos tokens
automatically.
padding (bool, str or [~utils.PaddingStrategy], optional, defaults to False) –
Activates and controls padding. Accepts the following values:
True or ‘longest’: Pad to the longest sequence in the batch (or no padding if only a single
sequence if provided).
’max_length’: Pad to a maximum length specified with the argument max_length or to the maximum
acceptable input length for the model if that argument is not provided.
False or ‘do_not_pad’ (default): No padding (i.e., can output a batch with sequences of different
lengths).
truncation (bool, str or [~tokenization_utils_base.TruncationStrategy], optional, defaults to False) –
Activates and controls truncation. Accepts the following values:
True or ‘longest_first’: Truncate to a maximum length specified with the argument max_length or
to the maximum acceptable input length for the model if that argument is not provided. This will
truncate token by token, removing a token from the longest sequence in the pair if a pair of
sequences (or a batch of pairs) is provided.
’only_first’: Truncate to a maximum length specified with the argument max_length or to the
maximum acceptable input length for the model if that argument is not provided. This will only
truncate the first sequence of a pair if a pair of sequences (or a batch of pairs) is provided.
’only_second’: Truncate to a maximum length specified with the argument max_length or to the
maximum acceptable input length for the model if that argument is not provided. This will only
truncate the second sequence of a pair if a pair of sequences (or a batch of pairs) is provided.
False or ‘do_not_truncate’ (default): No truncation (i.e., can output batch with sequence lengths
greater than the model maximum admissible input size).
max_length (int, optional) –
Controls the maximum length to use by one of the truncation/padding parameters.
If left unset or set to None, this will use the predefined model maximum length if a maximum length
is required by one of the truncation/padding parameters. If the model has no specific maximum input
length (like XLNet) truncation/padding to a maximum length will be deactivated.
stride (int, optional, defaults to 0) – If set to a number along with max_length, the overflowing tokens returned when
return_overflowing_tokens=True will contain some tokens from the end of the truncated sequence
returned to provide some overlap between truncated and overflowing sequences. The value of this
argument defines the number of overlapping tokens.
is_split_into_words (bool, optional, defaults to False) – Whether or not the input is already pre-tokenized (e.g., split into words). If set to True, the
tokenizer assumes the input is already split into words (for instance, by splitting it on whitespace)
which it will tokenize. This is useful for NER or token classification.
pad_to_multiple_of (int, optional) – If set will pad the sequence to a multiple of the provided value. Requires padding to be activated.
This is especially useful to enable the use of Tensor Cores on NVIDIA hardware with compute capability
>= 7.5 (Volta).
return_tensors (str or [~utils.TensorType], optional) –
If set, will return tensors instead of list of python integers. Acceptable values are:
’tf’: Return TensorFlow tf.constant objects.
’pt’: Return PyTorch torch.Tensor objects.
’np’: Return Numpy np.ndarray objects.
return_token_type_ids (bool, optional) –
Whether to return token type IDs. If left to the default, will return the token type IDs according to
the specific tokenizer’s default, defined by the return_outputs attribute.
[What are token type IDs?](../glossary#token-type-ids)
return_attention_mask (bool, optional) –
Whether to return the attention mask. If left to the default, will return the attention mask according
to the specific tokenizer’s default, defined by the return_outputs attribute.
[What are attention masks?](../glossary#attention-mask)
return_overflowing_tokens (bool, optional, defaults to False) – Whether or not to return overflowing token sequences. If a pair of sequences of input ids (or a batch
of pairs) is provided with truncation_strategy = longest_first or True, an error is raised instead
of returning overflowing tokens.
return_special_tokens_mask (bool, optional, defaults to False) – Whether or not to return special tokens mask information.
return_offsets_mapping (bool, optional, defaults to False) –
Whether or not to return (char_start, char_end) for each token.
This is only available on fast tokenizers inheriting from [PreTrainedTokenizerFast], if using
Python’s tokenizer, this method will raise NotImplementedError.
return_length (bool, optional, defaults to False) – Whether or not to return the lengths of the encoded inputs.
verbose (bool, optional, defaults to True) – Whether or not to print more information and warnings.
**kwargs – passed to the self.tokenize() method
Returns:
A [BatchEncoding] with the following fields:
input_ids – List of token ids to be fed to a model.
[What are input IDs?](../glossary#input-ids)
token_type_ids – List of token type ids to be fed to a model (when return_token_type_ids=True or
if “token_type_ids” is in self.model_input_names).
[What are token type IDs?](../glossary#token-type-ids)
attention_mask – List of indices specifying which tokens should be attended to by the model (when
return_attention_mask=True or if “attention_mask” is in self.model_input_names).
[What are attention masks?](../glossary#attention-mask)
overflowing_tokens – List of overflowing tokens sequences (when a max_length is specified and
return_overflowing_tokens=True).
num_truncated_tokens – Number of tokens truncated (when a max_length is specified and
return_overflowing_tokens=True).
special_tokens_mask – List of 0s and 1s, with 1 specifying added special tokens and 0 specifying
regular sequence tokens (when add_special_tokens=True and return_special_tokens_mask=True).
length – The length of the inputs (when return_length=True)
Whether or not the slow tokenizer can be saved. Usually for sentencepiece based slow tokenizer, this
can only be True if the original “sentencepiece.model” was not deleted.
Classification token, to extract a summary of an input sequence leveraging self-attention along the full
depth of the model. Log an error if used while not having been set.
Id of the classification token in the vocabulary, to extract a summary of an input sequence
leveraging self-attention along the full depth of the model.
Converts a sequence of tokens in a single string. The most simple way to do it is “ “.join(tokens) but we
often want to remove sub-word tokenization artifacts at the same time.
Parameters:
tokens (List[str]) – The token to join in a string.
Create a mask from the two sequences passed to be used in a sequence-pair classification task. RoBERTa does not
make use of token type ids, therefore a list of zeros is returned.
Parameters:
token_ids_0 (List[int]) – List of IDs.
token_ids_1 (List[int], optional) – Optional second list of IDs for sequence pairs.
Converts a sequence of ids in a string, using the tokenizer and vocabulary with options to remove special
tokens and clean up tokenization spaces.
Similar to doing self.convert_tokens_to_string(self.convert_ids_to_tokens(token_ids)).
Parameters:
token_ids (Union[int, List[int], np.ndarray, torch.Tensor, tf.Tensor]) – List of tokenized input ids. Can be obtained using the __call__ method.
skip_special_tokens (bool, optional, defaults to False) – Whether or not to remove special tokens in the decoding.
clean_up_tokenization_spaces (bool, optional) – Whether or not to clean up the tokenization spaces. If None, will default to
self.clean_up_tokenization_spaces.
kwargs (additional keyword arguments, optional) – Will be passed to the underlying model specific decode method.
Converts a string to a sequence of ids (integer), using the tokenizer and vocabulary.
Same as doing self.convert_tokens_to_ids(self.tokenize(text)).
Parameters:
text (str, List[str] or List[int]) – The first sequence to be encoded. This can be a string, a list of strings (tokenized string using the
tokenize method) or a list of integers (tokenized string ids using the convert_tokens_to_ids
method).
text_pair (str, List[str] or List[int], optional) – Optional second sequence to be encoded. This can be a string, a list of strings (tokenized string using
the tokenize method) or a list of integers (tokenized string ids using the convert_tokens_to_ids
method).
add_special_tokens (bool, optional, defaults to True) – Whether or not to add special tokens when encoding the sequences. This will use the underlying
PretrainedTokenizerBase.build_inputs_with_special_tokens function, which defines which tokens are
automatically added to the input ids. This is usefull if you want to add bos or eos tokens
automatically.
padding (bool, str or [~utils.PaddingStrategy], optional, defaults to False) –
Activates and controls padding. Accepts the following values:
True or ‘longest’: Pad to the longest sequence in the batch (or no padding if only a single
sequence if provided).
’max_length’: Pad to a maximum length specified with the argument max_length or to the maximum
acceptable input length for the model if that argument is not provided.
False or ‘do_not_pad’ (default): No padding (i.e., can output a batch with sequences of different
lengths).
truncation (bool, str or [~tokenization_utils_base.TruncationStrategy], optional, defaults to False) –
Activates and controls truncation. Accepts the following values:
True or ‘longest_first’: Truncate to a maximum length specified with the argument max_length or
to the maximum acceptable input length for the model if that argument is not provided. This will
truncate token by token, removing a token from the longest sequence in the pair if a pair of
sequences (or a batch of pairs) is provided.
’only_first’: Truncate to a maximum length specified with the argument max_length or to the
maximum acceptable input length for the model if that argument is not provided. This will only
truncate the first sequence of a pair if a pair of sequences (or a batch of pairs) is provided.
’only_second’: Truncate to a maximum length specified with the argument max_length or to the
maximum acceptable input length for the model if that argument is not provided. This will only
truncate the second sequence of a pair if a pair of sequences (or a batch of pairs) is provided.
False or ‘do_not_truncate’ (default): No truncation (i.e., can output batch with sequence lengths
greater than the model maximum admissible input size).
max_length (int, optional) –
Controls the maximum length to use by one of the truncation/padding parameters.
If left unset or set to None, this will use the predefined model maximum length if a maximum length
is required by one of the truncation/padding parameters. If the model has no specific maximum input
length (like XLNet) truncation/padding to a maximum length will be deactivated.
stride (int, optional, defaults to 0) – If set to a number along with max_length, the overflowing tokens returned when
return_overflowing_tokens=True will contain some tokens from the end of the truncated sequence
returned to provide some overlap between truncated and overflowing sequences. The value of this
argument defines the number of overlapping tokens.
is_split_into_words (bool, optional, defaults to False) – Whether or not the input is already pre-tokenized (e.g., split into words). If set to True, the
tokenizer assumes the input is already split into words (for instance, by splitting it on whitespace)
which it will tokenize. This is useful for NER or token classification.
pad_to_multiple_of (int, optional) – If set will pad the sequence to a multiple of the provided value. Requires padding to be activated.
This is especially useful to enable the use of Tensor Cores on NVIDIA hardware with compute capability
>= 7.5 (Volta).
return_tensors (str or [~utils.TensorType], optional) –
If set, will return tensors instead of list of python integers. Acceptable values are:
’tf’: Return TensorFlow tf.constant objects.
’pt’: Return PyTorch torch.Tensor objects.
’np’: Return Numpy np.ndarray objects.
**kwargs – Passed along to the .tokenize() method.
Tokenize and prepare for the model a sequence or a pair of sequences.
<Tip warning={true}>
This method is deprecated, __call__ should be used instead.
</Tip>
Parameters:
text (str, List[str] or List[int] (the latter only for not-fast tokenizers)) – The first sequence to be encoded. This can be a string, a list of strings (tokenized string using the
tokenize method) or a list of integers (tokenized string ids using the convert_tokens_to_ids
method).
text_pair (str, List[str] or List[int], optional) – Optional second sequence to be encoded. This can be a string, a list of strings (tokenized string using
the tokenize method) or a list of integers (tokenized string ids using the convert_tokens_to_ids
method).
add_special_tokens (bool, optional, defaults to True) – Whether or not to add special tokens when encoding the sequences. This will use the underlying
PretrainedTokenizerBase.build_inputs_with_special_tokens function, which defines which tokens are
automatically added to the input ids. This is usefull if you want to add bos or eos tokens
automatically.
padding (bool, str or [~utils.PaddingStrategy], optional, defaults to False) –
Activates and controls padding. Accepts the following values:
True or ‘longest’: Pad to the longest sequence in the batch (or no padding if only a single
sequence if provided).
’max_length’: Pad to a maximum length specified with the argument max_length or to the maximum
acceptable input length for the model if that argument is not provided.
False or ‘do_not_pad’ (default): No padding (i.e., can output a batch with sequences of different
lengths).
truncation (bool, str or [~tokenization_utils_base.TruncationStrategy], optional, defaults to False) –
Activates and controls truncation. Accepts the following values:
True or ‘longest_first’: Truncate to a maximum length specified with the argument max_length or
to the maximum acceptable input length for the model if that argument is not provided. This will
truncate token by token, removing a token from the longest sequence in the pair if a pair of
sequences (or a batch of pairs) is provided.
’only_first’: Truncate to a maximum length specified with the argument max_length or to the
maximum acceptable input length for the model if that argument is not provided. This will only
truncate the first sequence of a pair if a pair of sequences (or a batch of pairs) is provided.
’only_second’: Truncate to a maximum length specified with the argument max_length or to the
maximum acceptable input length for the model if that argument is not provided. This will only
truncate the second sequence of a pair if a pair of sequences (or a batch of pairs) is provided.
False or ‘do_not_truncate’ (default): No truncation (i.e., can output batch with sequence lengths
greater than the model maximum admissible input size).
max_length (int, optional) –
Controls the maximum length to use by one of the truncation/padding parameters.
If left unset or set to None, this will use the predefined model maximum length if a maximum length
is required by one of the truncation/padding parameters. If the model has no specific maximum input
length (like XLNet) truncation/padding to a maximum length will be deactivated.
stride (int, optional, defaults to 0) – If set to a number along with max_length, the overflowing tokens returned when
return_overflowing_tokens=True will contain some tokens from the end of the truncated sequence
returned to provide some overlap between truncated and overflowing sequences. The value of this
argument defines the number of overlapping tokens.
is_split_into_words (bool, optional, defaults to False) – Whether or not the input is already pre-tokenized (e.g., split into words). If set to True, the
tokenizer assumes the input is already split into words (for instance, by splitting it on whitespace)
which it will tokenize. This is useful for NER or token classification.
pad_to_multiple_of (int, optional) – If set will pad the sequence to a multiple of the provided value. Requires padding to be activated.
This is especially useful to enable the use of Tensor Cores on NVIDIA hardware with compute capability
>= 7.5 (Volta).
return_tensors (str or [~utils.TensorType], optional) –
If set, will return tensors instead of list of python integers. Acceptable values are:
’tf’: Return TensorFlow tf.constant objects.
’pt’: Return PyTorch torch.Tensor objects.
’np’: Return Numpy np.ndarray objects.
return_token_type_ids (bool, optional) –
Whether to return token type IDs. If left to the default, will return the token type IDs according to
the specific tokenizer’s default, defined by the return_outputs attribute.
[What are token type IDs?](../glossary#token-type-ids)
return_attention_mask (bool, optional) –
Whether to return the attention mask. If left to the default, will return the attention mask according
to the specific tokenizer’s default, defined by the return_outputs attribute.
[What are attention masks?](../glossary#attention-mask)
return_overflowing_tokens (bool, optional, defaults to False) – Whether or not to return overflowing token sequences. If a pair of sequences of input ids (or a batch
of pairs) is provided with truncation_strategy = longest_first or True, an error is raised instead
of returning overflowing tokens.
return_special_tokens_mask (bool, optional, defaults to False) – Whether or not to return special tokens mask information.
return_offsets_mapping (bool, optional, defaults to False) –
Whether or not to return (char_start, char_end) for each token.
This is only available on fast tokenizers inheriting from [PreTrainedTokenizerFast], if using
Python’s tokenizer, this method will raise NotImplementedError.
return_length (bool, optional, defaults to False) – Whether or not to return the lengths of the encoded inputs.
verbose (bool, optional, defaults to True) – Whether or not to print more information and warnings.
**kwargs – passed to the self.tokenize() method
Returns:
A [BatchEncoding] with the following fields:
input_ids – List of token ids to be fed to a model.
[What are input IDs?](../glossary#input-ids)
token_type_ids – List of token type ids to be fed to a model (when return_token_type_ids=True or
if “token_type_ids” is in self.model_input_names).
[What are token type IDs?](../glossary#token-type-ids)
attention_mask – List of indices specifying which tokens should be attended to by the model (when
return_attention_mask=True or if “attention_mask” is in self.model_input_names).
[What are attention masks?](../glossary#attention-mask)
overflowing_tokens – List of overflowing tokens sequences (when a max_length is specified and
return_overflowing_tokens=True).
num_truncated_tokens – Number of tokens truncated (when a max_length is specified and
return_overflowing_tokens=True).
special_tokens_mask – List of 0s and 1s, with 1 specifying added special tokens and 0 specifying
regular sequence tokens (when add_special_tokens=True and return_special_tokens_mask=True).
length – The length of the inputs (when return_length=True)
datapoints (Iterable[Any]) – A sequence of objects that you’d like to featurize. Subclassses of
Featurizer should instantiate the _featurize method that featurizes
objects in the sequence.
Instantiate a [~tokenization_utils_base.PreTrainedTokenizerBase] (or a derived class) from a predefined
tokenizer.
Parameters:
pretrained_model_name_or_path (str or os.PathLike) –
Can be either:
A string, the model id of a predefined tokenizer hosted inside a model repo on huggingface.co.
Valid model ids can be located at the root-level, like bert-base-uncased, or namespaced under a
user or organization name, like dbmdz/bert-base-german-cased.
A path to a directory containing vocabulary files required by the tokenizer, for instance saved
using the [~tokenization_utils_base.PreTrainedTokenizerBase.save_pretrained] method, e.g.,
./my_model_directory/.
(Deprecated, not applicable to all derived classes) A path or url to a single saved vocabulary
file (if and only if the tokenizer only requires a single vocabulary file like Bert or XLNet), e.g.,
./my_model_directory/vocab.txt.
cache_dir (str or os.PathLike, optional) – Path to a directory in which a downloaded predefined tokenizer vocabulary files should be cached if the
standard cache should not be used.
force_download (bool, optional, defaults to False) – Whether or not to force the (re-)download the vocabulary files and override the cached versions if they
exist.
resume_download (bool, optional, defaults to False) – Whether or not to delete incompletely received files. Attempt to resume the download if such a file
exists.
proxies (Dict[str, str], optional) – A dictionary of proxy servers to use by protocol or endpoint, e.g., {‘http’: ‘foo.bar:3128’,
‘http://hostname’: ‘foo.bar:4012’}. The proxies are used on each request.
token (str or bool, optional) – The token to use as HTTP bearer authorization for remote files. If True, will use the token generated
when running huggingface-cli login (stored in ~/.huggingface).
local_files_only (bool, optional, defaults to False) – Whether or not to only rely on local files and not to attempt to download any files.
revision (str, optional, defaults to “main”) – The specific model version to use. It can be a branch name, a tag name, or a commit id, since we use a
git-based system for storing models and other artifacts on huggingface.co, so revision can be any
identifier allowed by git.
subfolder (str, optional) – In case the relevant files are located inside a subfolder of the model repo on huggingface.co (e.g. for
facebook/rag-token-base), specify it here.
inputs (additional positional arguments, optional) – Will be passed along to the Tokenizer __init__ method.
kwargs (additional keyword arguments, optional) – Will be passed to the Tokenizer __init__ method. Can be used to set special tokens like bos_token,
eos_token, unk_token, sep_token, pad_token, cls_token, mask_token,
additional_special_tokens. See parameters in the __init__ for more details.
<Tip>
Passing token=True is required when you want to use a private model.
</Tip>
Examples:
```python
# We can’t instantiate directly the base class PreTrainedTokenizerBase so let’s show our examples on a derived class: BertTokenizer
# Download vocabulary from huggingface.co and cache.
tokenizer = BertTokenizer.from_pretrained(“bert-base-uncased”)
# Download vocabulary from huggingface.co (user-uploaded) and cache.
tokenizer = BertTokenizer.from_pretrained(“dbmdz/bert-base-german-cased”)
# If vocabulary files are in a directory (e.g. tokenizer was saved using save_pretrained(‘./test/saved_model/’))
tokenizer = BertTokenizer.from_pretrained(“./test/saved_model/”)
# If the tokenizer uses a single vocabulary file, you can point directly to this file
tokenizer = BertTokenizer.from_pretrained(“./test/saved_model/my_vocab.txt”)
# You can link tokens to special vocabulary when instantiating
tokenizer = BertTokenizer.from_pretrained(“bert-base-uncased”, unk_token=”<unk>”)
# You should be sure ‘<unk>’ is in the vocabulary when doing that.
# Otherwise use tokenizer.add_special_tokens({‘unk_token’: ‘<unk>’}) instead)
assert tokenizer.unk_token == “<unk>”
```
Retrieves sequence ids from a token list that has no special tokens added. This method is called when adding
special tokens using the tokenizer prepare_for_model or encode_plus methods.
Parameters:
token_ids_0 (List[int]) – List of ids of the first sequence.
token_ids_1 (List[int], optional) – List of ids of the second sequence.
already_has_special_tokens (bool, optional, defaults to False) – Whether or not the token list is already formatted with special tokens for the model.
Pad a single encoded input or a batch of encoded inputs up to predefined length or to the max sequence length
in the batch.
Padding side (left/right) padding token ids are defined at the tokenizer level (with self.padding_side,
self.pad_token_id and self.pad_token_type_id).
Please note that with a fast tokenizer, using the __call__ method is faster than using a method to encode the
text followed by a call to the pad method to get a padded encoding.
<Tip>
If the encoded_inputs passed are dictionary of numpy arrays, PyTorch tensors or TensorFlow tensors, the
result will use the same type unless you provide a different tensor type with return_tensors. In the case of
PyTorch tensors, you will lose the specific device of your tensors however.
</Tip>
Parameters:
encoded_inputs ([BatchEncoding], list of [BatchEncoding], Dict[str, List[int]], Dict[str, List[List[int]] or List[Dict[str, List[int]]]) –
Tokenized inputs. Can represent one input ([BatchEncoding] or Dict[str, List[int]]) or a batch of
tokenized inputs (list of [BatchEncoding], Dict[str, List[List[int]]] or List[Dict[str,
List[int]]]) so you can use this method during preprocessing as well as in a PyTorch Dataloader
collate function.
Instead of List[int] you can have tensors (numpy arrays, PyTorch tensors or TensorFlow tensors), see
the note above for the return type.
padding (bool, str or [~utils.PaddingStrategy], optional, defaults to True) –
Select a strategy to pad the returned sequences (according to the model’s padding side and padding
index) among:
True or ‘longest’: Pad to the longest sequence in the batch (or no padding if only a single
sequence if provided).
’max_length’: Pad to a maximum length specified with the argument max_length or to the maximum
acceptable input length for the model if that argument is not provided.
False or ‘do_not_pad’ (default): No padding (i.e., can output a batch with sequences of different
lengths).
max_length (int, optional) – Maximum length of the returned list and optionally padding length (see above).
pad_to_multiple_of (int, optional) –
If set will pad the sequence to a multiple of the provided value.
This is especially useful to enable the use of Tensor Cores on NVIDIA hardware with compute capability
>= 7.5 (Volta).
return_attention_mask (bool, optional) –
Whether to return the attention mask. If left to the default, will return the attention mask according
to the specific tokenizer’s default, defined by the return_outputs attribute.
[What are attention masks?](../glossary#attention-mask)
return_tensors (str or [~utils.TensorType], optional) –
If set, will return tensors instead of list of python integers. Acceptable values are:
’tf’: Return TensorFlow tf.constant objects.
’pt’: Return PyTorch torch.Tensor objects.
’np’: Return Numpy np.ndarray objects.
verbose (bool, optional, defaults to True) – Whether or not to print more information and warnings.
Prepares a sequence of input id, or a pair of sequences of inputs ids so that it can be used by the model. It
adds special tokens, truncates sequences if overflowing while taking into account the special tokens and
manages a moving window (with user defined stride) for overflowing tokens. Please Note, for pair_ids
different than None and truncation_strategy = longest_first or True, it is not possible to return
overflowing tokens. Such a combination of arguments will raise an error.
Parameters:
ids (List[int]) – Tokenized input ids of the first sequence. Can be obtained from a string by chaining the tokenize and
convert_tokens_to_ids methods.
pair_ids (List[int], optional) – Tokenized input ids of the second sequence. Can be obtained from a string by chaining the tokenize
and convert_tokens_to_ids methods.
add_special_tokens (bool, optional, defaults to True) – Whether or not to add special tokens when encoding the sequences. This will use the underlying
PretrainedTokenizerBase.build_inputs_with_special_tokens function, which defines which tokens are
automatically added to the input ids. This is usefull if you want to add bos or eos tokens
automatically.
padding (bool, str or [~utils.PaddingStrategy], optional, defaults to False) –
Activates and controls padding. Accepts the following values:
True or ‘longest’: Pad to the longest sequence in the batch (or no padding if only a single
sequence if provided).
’max_length’: Pad to a maximum length specified with the argument max_length or to the maximum
acceptable input length for the model if that argument is not provided.
False or ‘do_not_pad’ (default): No padding (i.e., can output a batch with sequences of different
lengths).
truncation (bool, str or [~tokenization_utils_base.TruncationStrategy], optional, defaults to False) –
Activates and controls truncation. Accepts the following values:
True or ‘longest_first’: Truncate to a maximum length specified with the argument max_length or
to the maximum acceptable input length for the model if that argument is not provided. This will
truncate token by token, removing a token from the longest sequence in the pair if a pair of
sequences (or a batch of pairs) is provided.
’only_first’: Truncate to a maximum length specified with the argument max_length or to the
maximum acceptable input length for the model if that argument is not provided. This will only
truncate the first sequence of a pair if a pair of sequences (or a batch of pairs) is provided.
’only_second’: Truncate to a maximum length specified with the argument max_length or to the
maximum acceptable input length for the model if that argument is not provided. This will only
truncate the second sequence of a pair if a pair of sequences (or a batch of pairs) is provided.
False or ‘do_not_truncate’ (default): No truncation (i.e., can output batch with sequence lengths
greater than the model maximum admissible input size).
max_length (int, optional) –
Controls the maximum length to use by one of the truncation/padding parameters.
If left unset or set to None, this will use the predefined model maximum length if a maximum length
is required by one of the truncation/padding parameters. If the model has no specific maximum input
length (like XLNet) truncation/padding to a maximum length will be deactivated.
stride (int, optional, defaults to 0) – If set to a number along with max_length, the overflowing tokens returned when
return_overflowing_tokens=True will contain some tokens from the end of the truncated sequence
returned to provide some overlap between truncated and overflowing sequences. The value of this
argument defines the number of overlapping tokens.
is_split_into_words (bool, optional, defaults to False) – Whether or not the input is already pre-tokenized (e.g., split into words). If set to True, the
tokenizer assumes the input is already split into words (for instance, by splitting it on whitespace)
which it will tokenize. This is useful for NER or token classification.
pad_to_multiple_of (int, optional) – If set will pad the sequence to a multiple of the provided value. Requires padding to be activated.
This is especially useful to enable the use of Tensor Cores on NVIDIA hardware with compute capability
>= 7.5 (Volta).
return_tensors (str or [~utils.TensorType], optional) –
If set, will return tensors instead of list of python integers. Acceptable values are:
’tf’: Return TensorFlow tf.constant objects.
’pt’: Return PyTorch torch.Tensor objects.
’np’: Return Numpy np.ndarray objects.
return_token_type_ids (bool, optional) –
Whether to return token type IDs. If left to the default, will return the token type IDs according to
the specific tokenizer’s default, defined by the return_outputs attribute.
[What are token type IDs?](../glossary#token-type-ids)
return_attention_mask (bool, optional) –
Whether to return the attention mask. If left to the default, will return the attention mask according
to the specific tokenizer’s default, defined by the return_outputs attribute.
[What are attention masks?](../glossary#attention-mask)
return_overflowing_tokens (bool, optional, defaults to False) – Whether or not to return overflowing token sequences. If a pair of sequences of input ids (or a batch
of pairs) is provided with truncation_strategy = longest_first or True, an error is raised instead
of returning overflowing tokens.
return_special_tokens_mask (bool, optional, defaults to False) – Whether or not to return special tokens mask information.
return_offsets_mapping (bool, optional, defaults to False) –
Whether or not to return (char_start, char_end) for each token.
This is only available on fast tokenizers inheriting from [PreTrainedTokenizerFast], if using
Python’s tokenizer, this method will raise NotImplementedError.
return_length (bool, optional, defaults to False) – Whether or not to return the lengths of the encoded inputs.
verbose (bool, optional, defaults to True) – Whether or not to print more information and warnings.
**kwargs – passed to the self.tokenize() method
Returns:
A [BatchEncoding] with the following fields:
input_ids – List of token ids to be fed to a model.
[What are input IDs?](../glossary#input-ids)
token_type_ids – List of token type ids to be fed to a model (when return_token_type_ids=True or
if “token_type_ids” is in self.model_input_names).
[What are token type IDs?](../glossary#token-type-ids)
attention_mask – List of indices specifying which tokens should be attended to by the model (when
return_attention_mask=True or if “attention_mask” is in self.model_input_names).
[What are attention masks?](../glossary#attention-mask)
overflowing_tokens – List of overflowing tokens sequences (when a max_length is specified and
return_overflowing_tokens=True).
num_truncated_tokens – Number of tokens truncated (when a max_length is specified and
return_overflowing_tokens=True).
special_tokens_mask – List of 0s and 1s, with 1 specifying added special tokens and 0 specifying
regular sequence tokens (when add_special_tokens=True and return_special_tokens_mask=True).
length – The length of the inputs (when return_length=True)
Prepare model inputs for translation. For best performance, translate one sentence at a time.
Parameters:
src_texts (List[str]) – List of documents to summarize or source language texts.
tgt_texts (list, optional) – List of summaries or target language texts.
max_length (int, optional) – Controls the maximum length for encoder inputs (documents to summarize or source language texts) If
left unset or set to None, this will use the predefined model maximum length if a maximum length is
required by one of the truncation/padding parameters. If the model has no specific maximum input length
(like XLNet) truncation/padding to a maximum length will be deactivated.
max_target_length (int, optional) – Controls the maximum length of decoder inputs (target language texts or summaries) If left unset or set
to None, this will use the max_length value.
padding (bool, str or [~utils.PaddingStrategy], optional, defaults to False) –
Activates and controls padding. Accepts the following values:
True or ‘longest’: Pad to the longest sequence in the batch (or no padding if only a single
sequence if provided).
’max_length’: Pad to a maximum length specified with the argument max_length or to the maximum
acceptable input length for the model if that argument is not provided.
False or ‘do_not_pad’ (default): No padding (i.e., can output a batch with sequences of different
lengths).
return_tensors (str or [~utils.TensorType], optional) –
If set, will return tensors instead of list of python integers. Acceptable values are:
’tf’: Return TensorFlow tf.constant objects.
’pt’: Return PyTorch torch.Tensor objects.
’np’: Return Numpy np.ndarray objects.
truncation (bool, str or [~tokenization_utils_base.TruncationStrategy], optional, defaults to True) –
Activates and controls truncation. Accepts the following values:
True or ‘longest_first’: Truncate to a maximum length specified with the argument max_length or
to the maximum acceptable input length for the model if that argument is not provided. This will
truncate token by token, removing a token from the longest sequence in the pair if a pair of
sequences (or a batch of pairs) is provided.
’only_first’: Truncate to a maximum length specified with the argument max_length or to the
maximum acceptable input length for the model if that argument is not provided. This will only
truncate the first sequence of a pair if a pair of sequences (or a batch of pairs) is provided.
’only_second’: Truncate to a maximum length specified with the argument max_length or to the
maximum acceptable input length for the model if that argument is not provided. This will only
truncate the second sequence of a pair if a pair of sequences (or a batch of pairs) is provided.
False or ‘do_not_truncate’ (default): No truncation (i.e., can output batch with sequence lengths
greater than the model maximum admissible input size).
**kwargs – Additional keyword arguments passed along to self.__call__.
Returns:
A [BatchEncoding] with the following fields:
input_ids – List of token ids to be fed to the encoder.
attention_mask – List of indices specifying which tokens should be attended to by the model.
labels – List of token ids for tgt_texts.
The full set of keys [input_ids, attention_mask, labels], will only be returned if tgt_texts is passed.
Otherwise, input_ids, attention_mask will be the only keys.
repo_id (str) – The name of the repository you want to push your tokenizer to. It should contain your organization name
when pushing to a given organization.
use_temp_dir (bool, optional) – Whether or not to use a temporary directory to store the files saved before they are pushed to the Hub.
Will default to True if there is no directory named like repo_id, False otherwise.
commit_message (str, optional) – Message to commit while pushing. Will default to “Upload tokenizer”.
private (bool, optional) – Whether or not the repository created should be private.
token (bool or str, optional) – The token to use as HTTP bearer authorization for remote files. If True, will use the token generated
when running huggingface-cli login (stored in ~/.huggingface). Will default to True if repo_url
is not specified.
max_shard_size (int or str, optional, defaults to “5GB”) – Only applicable for models. The maximum size for a checkpoint before being sharded. Checkpoints shard
will then be each of size lower than this size. If expressed as a string, needs to be digits followed
by a unit (like “5MB”). We default it to “5GB” so that users can easily load models on free-tier
Google Colab instances without any CPU OOM issues.
create_pr (bool, optional, defaults to False) – Whether or not to create a PR with the uploaded files or directly commit.
safe_serialization (bool, optional, defaults to True) – Whether or not to convert the model weights in safetensors format for safer serialization.
revision (str, optional) – Branch to push the uploaded files to.
commit_description (str, optional) – The description of the commit that will be created
tags (List[str], optional) – List of tags to push on the Hub.
Register this class with a given auto class. This should only be used for custom tokenizers as the ones in the
library are already mapped with AutoTokenizer.
<Tip warning={true}>
This API is experimental and may have some slight breaking changes in the next releases.
</Tip>
Parameters:
auto_class (str or type, optional, defaults to “AutoTokenizer”) – The auto class to register this new tokenizer with.
This method make sure the full tokenizer can then be re-loaded using the
[~tokenization_utils_base.PreTrainedTokenizer.from_pretrained] class method..
Warning,None This won’t save modifications you may have applied to the tokenizer after the instantiation (for
instance, modifying tokenizer.do_lower_case after creation).
Parameters:
save_directory (str or os.PathLike) – The path to a directory where the tokenizer will be saved.
legacy_format (bool, optional) –
Only applicable for a fast tokenizer. If unset (default), will save the tokenizer in the unified JSON
format as well as in legacy format if it exists, i.e. with tokenizer specific vocabulary and a separate
added_tokens files.
If False, will only save the tokenizer in the unified JSON format. This format is incompatible with
“slow” tokenizers (not powered by the tokenizers library), so the tokenizer will not be able to be
loaded in the corresponding “slow” tokenizer.
If True, will save the tokenizer in legacy format. If the “slow” tokenizer doesn’t exits, a value
error is raised.
filename_prefix (str, optional) – A prefix to add to the names of the files saved by the tokenizer.
push_to_hub (bool, optional, defaults to False) – Whether or not to push your model to the Hugging Face model hub after saving it. You can specify the
repository you want to push to with repo_id (will default to the name of save_directory in your
namespace).
kwargs (Dict[str, Any], optional) – Additional key word arguments passed along to the [~utils.PushToHubMixin.push_to_hub] method.
Save only the vocabulary of the tokenizer (vocabulary + added tokens).
This method won’t save the configuration and special token mappings of the tokenizer. Use
[~PreTrainedTokenizerFast._save_pretrained] to save the whole state of the tokenizer.
Parameters:
save_directory (str) – The directory in which to save the vocabulary.
filename_prefix (str, optional) – An optional prefix to add to the named of the saved files.
Define the truncation and the padding strategies for fast tokenizers (provided by HuggingFace tokenizers
library) and restore the tokenizer settings afterwards.
The provided tokenizer has no padding / truncation strategy before the managed section. If your tokenizer set a
padding / truncation strategy before, then it will be reset to no padding / truncation when exiting the managed
section.
Parameters:
padding_strategy ([~utils.PaddingStrategy]) – The kind of padding that will be applied to the input
truncation_strategy ([~tokenization_utils_base.TruncationStrategy]) – The kind of truncation that will be applied to the input
max_length (int) – The maximum size of a sequence.
stride (int) – The stride to use when handling overflow.
pad_to_multiple_of (int, optional) – If set will pad the sequence to a multiple of the provided value. This is especially useful to enable
the use of Tensor Cores on NVIDIA hardware with compute capability >= 7.5 (Volta).
Converts a string into a sequence of tokens, replacing unknown tokens with the unk_token.
Parameters:
text (str) – The sequence to be encoded.
pair (str, optional) – A second sequence to be encoded with the first.
add_special_tokens (bool, optional, defaults to False) – Whether or not to add the special tokens associated with the corresponding model.
kwargs (additional keyword arguments, optional) – Will be passed to the underlying model specific encode method. See details in
[~PreTrainedTokenizerBase.__call__]
Trains a tokenizer on a new corpus with the same defaults (in terms of special tokens or tokenization pipeline)
as the current one.
Parameters:
text_iterator (generator of List[str]) – The training corpus. Should be a generator of batches of texts, for instance a list of lists of texts
if you have everything in memory.
vocab_size (int) – The size of the vocabulary you want for your tokenizer.
length (int, optional) – The total number of sequences in the iterator. This is used to provide meaningful progress tracking
new_special_tokens (list of str or AddedToken, optional) – A list of new special tokens to add to the tokenizer you are training.
special_tokens_map (Dict[str, str], optional) – If you want to rename some of the special tokens this tokenizer uses, pass along a mapping old special
token name to new special token name in this argument.
kwargs (Dict[str, Any], optional) – Additional keyword arguments passed along to the trainer from the 🤗 Tokenizers library.
Returns:
A new tokenizer of the same type as the original one, trained on
text_iterator.
Truncates a sequence pair in-place following the strategy.
Parameters:
ids (List[int]) – Tokenized input ids of the first sequence. Can be obtained from a string by chaining the tokenize and
convert_tokens_to_ids methods.
pair_ids (List[int], optional) – Tokenized input ids of the second sequence. Can be obtained from a string by chaining the tokenize
and convert_tokens_to_ids methods.
num_tokens_to_remove (int, optional, defaults to 0) – Number of tokens to remove using the truncation strategy.
truncation_strategy (str or [~tokenization_utils_base.TruncationStrategy], optional, defaults to False) –
The strategy to follow for truncation. Can be:
’longest_first’: Truncate to a maximum length specified with the argument max_length or to the
maximum acceptable input length for the model if that argument is not provided. This will truncate
token by token, removing a token from the longest sequence in the pair if a pair of sequences (or a
batch of pairs) is provided.
’only_first’: Truncate to a maximum length specified with the argument max_length or to the
maximum acceptable input length for the model if that argument is not provided. This will only
truncate the first sequence of a pair if a pair of sequences (or a batch of pairs) is provided.
’only_second’: Truncate to a maximum length specified with the argument max_length or to the
maximum acceptable input length for the model if that argument is not provided. This will only
truncate the second sequence of a pair if a pair of sequences (or a batch of pairs) is provided.
’do_not_truncate’ (default): No truncation (i.e., can output batch with sequence lengths greater
than the model maximum admissible input size).
stride (int, optional, defaults to 0) – If set to a positive number, the overflowing tokens returned will contain some tokens from the main
sequence returned. The value of this argument defines the number of additional tokens.
Returns:
The truncated ids, the truncated pair_ids and the list of
overflowing tokens. Note: The longest_first strategy returns empty list of overflowing tokens if a pair
of sequences (or a batch of pairs) is provided.
RxnFeaturizer is a wrapper class for HuggingFace’s RobertaTokenizerFast,
that is intended for featurizing chemical reaction datasets. The featurizer
computes the source and target required for a seq2seq task and applies the
RobertaTokenizer on them separately. Additionally, it can also separate or
mix the reactants and reagents before tokenizing.
datapoints (Iterable[Any]) – A sequence of objects that you’d like to featurize. Subclassses of
Featurizer should instantiate the _featurize method that featurizes
objects in the sequence.
Featurizes binding pockets with information about chemical
environments.
In many applications, it’s desirable to look at binding pockets on
macromolecules which may be good targets for potential ligands or
other molecules to interact with. A BindingPocketFeaturizer
expects to be given a macromolecule, and a list of pockets to
featurize on that macromolecule. These pockets should be of the form
produced by a dc.dock.BindingPocketFinder, that is as a list of
dc.utils.CoordinateBox objects.
The base featurization in this class’s featurization is currently
very simple and counts the number of residues of each type present
in the pocket. It’s likely that you’ll want to overwrite this
implementation for more sophisticated downstream usecases. Note that
this class’s implementation will only work for proteins and not for
other macromolecules
datapoints (Iterable[Any]) – A sequence of objects that you’d like to featurize. Subclassses of
Featurizer should instantiate the _featurize method that featurizes
objects in the sequence.
Class that implements a no-op featurization.
This is useful when the raw dataset has to be used without featurizing the
examples. The Molnet loader requires a featurizer input and such datasets
can be used in their original form by passing the raw featurizer.
Abstract class for calculating a set of features for a datapoint.
This class is abstract and cannot be invoked directly. You’ll
likely only interact with this class if you’re a developer. In
that case, you might want to make a child class which
implements the _featurize method for calculating features for
a single datapoints if you’d like to make a featurizer for a
new datatype.
datapoints (Iterable[Any]) – A sequence of objects that you’d like to featurize. Subclassses of
Featurizer should instantiate the _featurize method that featurizes
objects in the sequence.
If you’re creating a new featurizer that featurizes molecules,
you will want to inherit from the abstract MolecularFeaturizer base class.
This featurizer can take RDKit mol objects or SMILES as inputs.
Abstract class for calculating a set of features for a
molecule.
The defining feature of a MolecularFeaturizer is that it
uses SMILES strings and RDKit molecule objects to represent
small molecules. All other featurizers which are subclasses of
this class should plan to process input which comes as smiles
strings or RDKit molecules.
Child classes need to implement the _featurize method for
calculating features for a single molecule.
The subclasses of this class require RDKit to be installed.
If you’re creating a new featurizer that featurizes compositional formulas,
you will want to inherit from the abstract MaterialCompositionFeaturizer base class.
Abstract class for calculating a set of features for an
inorganic crystal composition.
The defining feature of a MaterialCompositionFeaturizer is that it
operates on 3D crystal chemical compositions.
Inorganic crystal compositions are represented by Pymatgen composition
objects. Featurizers for inorganic crystal compositions that are
subclasses of this class should plan to process input which comes as
Pymatgen composition objects.
This class is abstract and cannot be invoked directly. You’ll
likely only interact with this class if you’re a developer. Child
classes need to implement the _featurize method for calculating
features for a single crystal composition.
Note
Some subclasses of this class will require pymatgen and matminer to be
installed.
If you’re creating a new featurizer that featurizes inorganic crystal structure,
you will want to inherit from the abstract MaterialCompositionFeaturizer base class.
This featurizer can take pymatgen structure objects or dictionaries as inputs.
Abstract class for calculating a set of features for an
inorganic crystal structure.
The defining feature of a MaterialStructureFeaturizer is that it
operates on 3D crystal structures with periodic boundary conditions.
Inorganic crystal structures are represented by Pymatgen structure
objects. Featurizers for inorganic crystal structures that are subclasses of
this class should plan to process input which comes as pymatgen
structure objects.
This class is abstract and cannot be invoked directly. You’ll
likely only interact with this class if you’re a developer. Child
classes need to implement the _featurize method for calculating
features for a single crystal structure.
Note
Some subclasses of this class will require pymatgen and matminer to be
installed.
datapoints (Iterable[Union[Dict, pymatgen.core.Structure]]) – Iterable sequence of pymatgen structure dictionaries
or pymatgen.core.Structure. Please confirm the dictionary representations
of pymatgen.core.Structure from https://pymatgen.org/pymatgen.core.structure.html.
If you’re creating a new featurizer that featurizes a pair of ligand molecules and proteins,
you will want to inherit from the abstract ComplexFeaturizer base class.
This featurizer can take a pair of PDB or SDF files which contain ligand molecules and proteins.
If you’re creating a vocabulary builder for generating vocabulary from a corpus or input data,
the vocabulary builder must inhere from VocabularyBuilder base class.
alias of <module ‘deepchem.feat.vocabulary_builders.hf_vocab’ from ‘/home/docs/checkouts/readthedocs.org/user_builds/deepchem/checkouts/latest/deepchem/feat/vocabulary_builders/hf_vocab.py’>