Tokenizers

A tokenizer is in charge of preparing the inputs for a natural language processing model. For many scientific applications, it is possible to treat inputs as “words”/”sentences” and use NLP methods to make meaningful predictions. For example, SMILES strings or DNA sequences have grammatical structure and can be usefully modeled with NLP techniques. DeepChem provides some scientifically relevant tokenizers for use in different applications. These tokenizers are based on those from the Huggingface transformers library (which DeepChem tokenizers inherit from).

The base classes PreTrainedTokenizer and PreTrainedTokenizerFast implements the common methods for encoding string inputs in model inputs and instantiating/saving python tokenizers either from a local file or directory or from a pretrained tokenizer provided by the library (downloaded from HuggingFace’s AWS S3 repository).

PreTrainedTokenizer (transformers.PreTrainedTokenizer) thus implements the main methods for using all the tokenizers:

  • Tokenizing (spliting strings in sub-word token strings), converting tokens strings to ids and back, and encoding/decoding (i.e. tokenizing + convert to integers),
  • Adding new tokens to the vocabulary in a way that is independant of the underlying structure (BPE, SentencePiece…),
  • Managing special tokens like mask, beginning-of-sentence, etc tokens (adding them, assigning them to attributes in the tokenizer for easy access and making sure they are not split during tokenization)

BatchEncoding holds the output of the tokenizer’s encoding methods (__call__, encode_plus and batch_encode_plus) and is derived from a Python dictionary. When the tokenizer is a pure python tokenizer, this class behave just like a standard python dictionary and hold the various model inputs computed by these methodes (input_ids, attention_mask…).

For more details on the base tokenizers which the DeepChem tokenizers inherit from, please refer to the following: HuggingFace tokenizers docs

Tokenization methods on string-based corpuses in the life sciences are becoming increasingly popular for NLP-based applications to chemistry and biology. One such example is ChemBERTa, a transformer for molecular property prediction. DeepChem offers a tutorial for utilizing ChemBERTa using an alternate tokenizer, a Byte-Piece Encoder, which can be found here.

SmilesTokenizer

The dc.feat.SmilesTokenizer module inherits from the BertTokenizer class in transformers. It runs a WordPiece tokenization algorithm over SMILES strings using the tokenisation SMILES regex developed by Schwaller et. al.

The SmilesTokenizer employs an atom-wise tokenization strategy using the following Regex expression:

>>> SMI_REGEX_PATTERN = "(\[[^\]]+]|Br?|Cl?|N|O|S|P|F|I|b|c|n|o|s|p|\(|\)|\.|=|#||\+|\\\\\/|:||@|\?|>|\*|\$|\%[0–9]{2}|[0–9])"

To use, please install the transformers package using the following pip command:

>>> pip install transformers

References:

class deepchem.feat.SmilesTokenizer(vocab_file: str = '', **kwargs)[source]

Creates the SmilesTokenizer class. The tokenizer heavily inherits from the BertTokenizer implementation found in Huggingface’s transformers library. It runs a WordPiece tokenization algorithm over SMILES strings using the tokenisation SMILES regex developed by Schwaller et. al.

Please see https://github.com/huggingface/transformers and https://github.com/rxn4chemistry/rxnfp for more details.

Examples

>>> from deepchem.feat.smiles_tokenizer import SmilesTokenizer
>>> current_dir = os.path.dirname(os.path.realpath(__file__))
>>> vocab_path = os.path.join(current_dir, 'tests/data', 'vocab.txt')
>>> tokenizer = SmilesTokenizer(vocab_path)
>>> print(tokenizer.encode("CC(=O)OC1=CC=CC=C1C(=O)O"))
[12, 16, 16, 17, 22, 19, 18, 19, 16, 20, 22, 16, 16, 22, 16, 16, 22, 16, 20, 16, 17, 22, 19, 18, 19, 13]

References

[1]Schwaller, Philippe; Probst, Daniel; Vaucher, Alain C.; Nair, Vishnu H; Kreutter, David; Laino, Teodoro; et al. (2019): Mapping the Space of Chemical Reactions using Attention-Based Neural Networks. ChemRxiv. Preprint. https://doi.org/10.26434/chemrxiv.9897365.v3

Notes

This class requires huggingface’s transformers and tokenizers libraries to be installed.

__init__(vocab_file: str = '', **kwargs)[source]

Constructs a SmilesTokenizer.

Parameters:vocab_file (str) – Path to a SMILES character per line vocabulary file. Default vocab file is found in deepchem/feat/tests/data/vocab.txt
add_padding_tokens(token_ids: List[int], length: int, right: bool = True) → List[int][source]

Adds padding tokens to return a sequence of length max_length. By default padding tokens are added to the right of the sequence.

Parameters:
  • token_ids (list[int]) – list of tokenized input ids. Can be obtained using the encode or encode_plus methods.
  • length (int) –
  • right (bool (True by default)) –
Returns:

  • token_ids – list of tokenized input ids. Can be obtained using the encode or encode_plus methods.
  • padding (int) – Integer to be added as padding token

add_special_tokens(special_tokens_dict: Dict[str, Union[str, tokenizers.AddedToken]]) → int[source]

Add a dictionary of special tokens (eos, pad, cls, etc.) to the encoder and link them to class attributes. If special tokens are NOT in the vocabulary, they are added to it (indexed starting from the last index of the current vocabulary).

Using : obj:add_special_tokens will ensure your special tokens can be used in several ways:

  • Special tokens are carefully handled by the tokenizer (they are never split).
  • You can easily refer to special tokens using tokenizer class attributes like tokenizer.cls_token. This makes it easy to develop model-agnostic training and fine-tuning scripts.

When possible, special tokens are already registered for provided pretrained models (for instance BertTokenizer cls_token is already registered to be :obj`’[CLS]’` and XLM’s one is also registered to be '</s>').

Parameters:special_tokens_dict (dictionary str to str or tokenizers.AddedToken) –

Keys should be in the list of predefined special attributes: [bos_token, eos_token, unk_token, sep_token, pad_token, cls_token, mask_token, additional_special_tokens].

Tokens are only added if they are not already in the vocabulary (tested by checking if the tokenizer assign the index of the unk_token to them).

Returns:Number of tokens added to the vocabulary.
Return type:int

Examples:

# Let's see how to add a new classification token to GPT-2
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
model = GPT2Model.from_pretrained('gpt2')

special_tokens_dict = {'cls_token': '<CLS>'}

num_added_toks = tokenizer.add_special_tokens(special_tokens_dict)
print('We have added', num_added_toks, 'tokens')
# Notice: resize_token_embeddings expect to receive the full size of the new vocabulary, i.e., the length of the tokenizer.
model.resize_token_embeddings(len(tokenizer))

assert tokenizer.cls_token == '<CLS>'
add_special_tokens_ids_sequence_pair(token_ids_0: List[int], token_ids_1: List[int]) → List[int][source]

Adds special tokens to a sequence pair for sequence classification tasks. A BERT sequence pair has the following format: [CLS] A [SEP] B [SEP]

Parameters:
  • token_ids_0 (List[int]) – List of ids for the first string sequence in the sequence pair (A).
  • token_ids_1 (List[int]) – List of tokens for the second string sequence in the sequence pair (B).
add_special_tokens_ids_single_sequence(token_ids: List[int])[source]

Adds special tokens to the a sequence for sequence classification tasks. A BERT sequence has the following format: [CLS] X [SEP]

Parameters:token_ids (list[int]) – list of tokenized input ids. Can be obtained using the encode or encode_plus methods.
add_special_tokens_single_sequence(tokens: List[str])[source]

Adds special tokens to the a sequence for sequence classification tasks. A BERT sequence has the following format: [CLS] X [SEP]

Parameters:tokens (List[str]) – List of tokens for a given string sequence.
add_tokens(new_tokens: Union[str, tokenizers.AddedToken, List[Union[str, tokenizers.AddedToken]]], special_tokens: bool = False) → int[source]

Add a list of new tokens to the tokenizer class. If the new tokens are not in the vocabulary, they are added to it with indices starting from length of the current vocabulary.

Parameters:
  • new_tokens (str, tokenizers.AddedToken or a list of str or tokenizers.AddedToken) – Tokens are only added if they are not already in the vocabulary. tokenizers.AddedToken wraps a string token to let you personalize its behavior: whether this token should only match against a single word, whether this token should strip all potential whitespaces on the left side, whether this token should strip all potential whitespaces on the right side, etc.
  • special_token (bool, optional, defaults to False) –

    Can be used to specify if the token is a special token. This mostly change the normalization behavior (special tokens like CLS or [MASK] are usually not lower-cased for instance).

    See details for tokenizers.AddedToken in HuggingFace tokenizers library.

Returns:

Number of tokens added to the vocabulary.

Return type:

int

Examples:

# Let's see how to increase the vocabulary of Bert model and tokenizer
tokenizer = BertTokenizerFast.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

num_added_toks = tokenizer.add_tokens(['new_tok1', 'my_new-tok2'])
print('We have added', num_added_toks, 'tokens')
 # Notice: resize_token_embeddings expect to receive the full size of the new vocabulary, i.e., the length of the tokenizer.
model.resize_token_embeddings(len(tokenizer))
additional_special_tokens

All the additional special tokens you may want to use. Log an error if used while not having been set.

Type:List[str]
additional_special_tokens_ids

Ids of all the additional special tokens in the vocabulary. Log an error if used while not having been set.

Type:List[int]
all_special_ids

List the ids of the special tokens('<unk>', '<cls>', etc.) mapped to class attributes.

Type:List[int]
all_special_tokens

All the special tokens ('<unk>', '<cls>', etc.) mapped to class attributes.

Convert tokens of tokenizers.AddedToken type to string.

Type:List[str]
all_special_tokens_extended

All the special tokens ('<unk>', '<cls>', etc.) mapped to class attributes.

Don’t convert tokens of tokenizers.AddedToken type to string so they can be used to control more finely how special tokens are tokenized.

Type:List[Union[str, tokenizers.AddedToken]]
batch_decode(sequences: List[List[int]], skip_special_tokens: bool = False, clean_up_tokenization_spaces: bool = True) → List[str][source]

Convert a list of lists of token ids into a list of strings by calling decode.

Parameters:
  • sequences (List[List[int]]) – List of tokenized input ids. Can be obtained using the __call__ method.
  • skip_special_tokens (bool, optional, defaults to False) – Whether or not to remove special tokens in the decoding.
  • clean_up_tokenization_spaces (bool, optional, defaults to True) – Whether or not to clean up the tokenization spaces.
Returns:

The list of decoded sentences.

Return type:

List[str]

batch_encode_plus(batch_text_or_text_pairs: Union[List[str], List[Tuple[str, str]], List[List[str]], List[Tuple[List[str], List[str]]], List[List[int]], List[Tuple[List[int], List[int]]]], add_special_tokens: bool = True, padding: Union[bool, str, transformers.tokenization_utils_base.PaddingStrategy] = False, truncation: Union[bool, str, transformers.tokenization_utils_base.TruncationStrategy] = False, max_length: Optional[int] = None, stride: int = 0, is_split_into_words: bool = False, pad_to_multiple_of: Optional[int] = None, return_tensors: Union[str, transformers.tokenization_utils_base.TensorType, None] = None, return_token_type_ids: Optional[bool] = None, return_attention_mask: Optional[bool] = None, return_overflowing_tokens: bool = False, return_special_tokens_mask: bool = False, return_offsets_mapping: bool = False, return_length: bool = False, verbose: bool = True, **kwargs) → transformers.tokenization_utils_base.BatchEncoding[source]

Tokenize and prepare for the model a list of sequences or a list of pairs of sequences.

Warning

This method is deprecated, __call__ should be used instead.

Parameters:
  • batch_text_or_text_pairs (List[str], List[Tuple[str, str]], List[List[str]], List[Tuple[List[str], List[str]]], and for not-fast tokenizers, also List[List[int]], List[Tuple[List[int], List[int]]]) – Batch of sequences or pair of sequences to be encoded. This can be a list of string/string-sequences/int-sequences or a list of pair of string/string-sequences/int-sequence (see details in encode_plus).
  • add_special_tokens (bool, optional, defaults to True) – Whether or not to encode the sequences with the special tokens relative to their model.
  • padding (bool, str or PaddingStrategy, optional, defaults to False) –

    Activates and controls padding. Accepts the following values:

    • True or 'longest': Pad to the longest sequence in the batch (or no padding if only a single sequence if provided).
    • 'max_length': Pad to a maximum length specified with the argument max_length or to the maximum acceptable input length for the model if that argument is not provided.
    • False or 'do_not_pad' (default): No padding (i.e., can output a batch with sequences of different lengths).
  • truncation (bool, str or TruncationStrategy, optional, defaults to False) –

    Activates and controls truncation. Accepts the following values:

    • True or 'longest_first': Truncate to a maximum length specified with the argument max_length or to the maximum acceptable input length for the model if that argument is not provided. This will truncate token by token, removing a token from the longest sequence in the pair if a pair of sequences (or a batch of pairs) is provided.
    • 'only_first': Truncate to a maximum length specified with the argument max_length or to the maximum acceptable input length for the model if that argument is not provided. This will only truncate the first sequence of a pair if a pair of sequences (or a batch of pairs) is provided.
    • 'only_second': Truncate to a maximum length specified with the argument max_length or to the maximum acceptable input length for the model if that argument is not provided. This will only truncate the second sequence of a pair if a pair of sequences (or a batch of pairs) is provided.
    • False or 'do_not_truncate' (default): No truncation (i.e., can output batch with sequence lengths greater than the model maximum admissible input size).
  • max_length (int, optional) –

    Controls the maximum length to use by one of the truncation/padding parameters.

    If left unset or set to None, this will use the predefined model maximum length if a maximum length is required by one of the truncation/padding parameters. If the model has no specific maximum input length (like XLNet) truncation/padding to a maximum length will be deactivated.

  • stride (int, optional, defaults to 0) – If set to a number along with max_length, the overflowing tokens returned when return_overflowing_tokens=True will contain some tokens from the end of the truncated sequence returned to provide some overlap between truncated and overflowing sequences. The value of this argument defines the number of overlapping tokens.
  • is_split_into_words (bool, optional, defaults to False) – Whether or not the input is already pre-tokenized (e.g., split into words), in which case the tokenizer will skip the pre-tokenization step. This is useful for NER or token classification.
  • pad_to_multiple_of (int, optional) – If set will pad the sequence to a multiple of the provided value. This is especially useful to enable the use of Tensor Cores on NVIDIA hardware with compute capability >= 7.5 (Volta).
  • return_tensors (str or TensorType, optional) –

    If set, will return tensors instead of list of python integers. Acceptable values are:

    • 'tf': Return TensorFlow tf.constant objects.
    • 'pt': Return PyTorch torch.Tensor objects.
    • 'np': Return Numpy np.ndarray objects.
  • return_token_type_ids (bool, optional) –

    Whether to return token type IDs. If left to the default, will return the token type IDs according to the specific tokenizer’s default, defined by the return_outputs attribute.

    What are token type IDs?

  • return_attention_mask (bool, optional) –

    Whether to return the attention mask. If left to the default, will return the attention mask according to the specific tokenizer’s default, defined by the return_outputs attribute.

    What are attention masks?

  • return_overflowing_tokens (bool, optional, defaults to False) – Whether or not to return overflowing token sequences.
  • return_special_tokens_mask (bool, optional, defaults to False) – Wheter or not to return special tokens mask information.
  • return_offsets_mapping (bool, optional, defaults to False) –

    Whether or not to return (char_start, char_end) for each token.

    This is only available on fast tokenizers inheriting from PreTrainedTokenizerFast, if using Python’s tokenizer, this method will raise NotImplementedError.

  • return_length (bool, optional, defaults to False) – Whether or not to return the lengths of the encoded inputs.
  • verbose (bool, optional, defaults to True) – Whether or not to print informations and warnings.
  • **kwargs – passed to the self.tokenize() method
Returns:

A BatchEncoding with the following fields:

  • input_ids – List of token ids to be fed to a model.

    What are input IDs?

  • token_type_ids – List of token type ids to be fed to a model (when return_token_type_ids=True or if “token_type_ids” is in self.model_input_names).

    What are token type IDs?

  • attention_mask – List of indices specifying which tokens should be attended to by the model (when return_attention_mask=True or if “attention_mask” is in self.model_input_names).

    What are attention masks?

  • overflowing_tokens – List of overflowing tokens sequences (when a max_length is specified and return_overflowing_tokens=True).

  • num_truncated_tokens – Number of tokens truncated (when a max_length is specified and return_overflowing_tokens=True).

  • special_tokens_mask – List of 0s and 1s, with 0 specifying added special tokens and 1 specifying regual sequence tokens (when add_special_tokens=True and return_special_tokens_mask=True).

  • length – The length of the inputs (when return_length=True)

Return type:

BatchEncoding

bos_token

Beginning of sentence token. Log an error if used while not having been set.

Type:str
bos_token_id

Id of the beginning of sentence token in the vocabulary. Returns None if the token has not been set.

Type:Optional[int]
build_inputs_with_special_tokens(token_ids_0: List[int], token_ids_1: Optional[List[int]] = None) → List[int][source]

Build model inputs from a sequence or a pair of sequence for sequence classification tasks by concatenating and adding special tokens. A BERT sequence has the following format:

  • single sequence: [CLS] X [SEP]
  • pair of sequences: [CLS] A [SEP] B [SEP]
Parameters:
  • token_ids_0 (List[int]) – List of IDs to which the special tokens will be added
  • token_ids_1 (List[int], optional) – Optional second list of IDs for sequence pairs.
Returns:

list of input IDs with the appropriate special tokens.

Return type:

List[int]

static clean_up_tokenization(out_string: str) → str[source]

Clean up a list of simple English tokenization artifacts like spaces before punctuations and abreviated forms.

Parameters:out_string (str) – The text to clean up.
Returns:The cleaned-up string.
Return type:str
cls_token

Classification token, to extract a summary of an input sequence leveraging self-attention along the full depth of the model. Log an error if used while not having been set.

Type:str
cls_token_id

Id of the classification token in the vocabulary, to extract a summary of an input sequence leveraging self-attention along the full depth of the model.

Returns None if the token has not been set.

Type:Optional[int]
convert_ids_to_tokens(ids: Union[int, List[int]], skip_special_tokens: bool = False) → Union[str, List[str]][source]

Converts a single index or a sequence of indices in a token or a sequence of tokens, using the vocabulary and added tokens.

Parameters:
  • ids (int or List[int]) – The token id (or token ids) to convert to tokens.
  • skip_special_tokens (bool, optional, defaults to False) – Whether or not to remove special tokens in the decoding.
Returns:

The decoded token(s).

Return type:

str or List[str]

convert_tokens_to_ids(tokens: Union[str, List[str]]) → Union[int, List[int]][source]

Converts a token string (or a sequence of tokens) in a single integer id (or a sequence of ids), using the vocabulary.

Parameters:token (str or List[str]) – One or several token(s) to convert to token id(s).
Returns:The token id or list of token ids.
Return type:int or List[int]
convert_tokens_to_string(tokens: List[str])[source]

Converts a sequence of tokens (string) in a single string.

Parameters:tokens (List[str]) – List of tokens for a given string sequence.
Returns:out_string – Single string from combined tokens.
Return type:str
create_token_type_ids_from_sequences(token_ids_0: List[int], token_ids_1: Optional[List[int]] = None) → List[int][source]

Creates a mask from the two sequences passed to be used in a sequence-pair classification task. A BERT sequence pair mask has the following format:

0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1
| first sequence    | second sequence |

if token_ids_1 is None, only returns the first portion of the mask (0’s).

Parameters:
  • token_ids_0 (List[int]) – List of ids.
  • token_ids_1 (List[int], optional) – Optional second list of IDs for sequence pairs.
Returns:

List of token type IDs according to the given sequence(s).

Return type:

List[int]

decode(token_ids: List[int], skip_special_tokens: bool = False, clean_up_tokenization_spaces: bool = True) → str[source]

Converts a sequence of ids in a string, using the tokenizer and vocabulary with options to remove special tokens and clean up tokenization spaces.

Similar to doing self.convert_tokens_to_string(self.convert_ids_to_tokens(token_ids)).

Parameters:
  • token_ids (List[int]) – List of tokenized input ids. Can be obtained using the __call__ method.
  • skip_special_tokens (bool, optional, defaults to False) – Whether or not to remove special tokens in the decoding.
  • clean_up_tokenization_spaces (bool, optional, defaults to True) – Whether or not to clean up the tokenization spaces.
Returns:

The decoded sentence.

Return type:

str

encode(text: Union[str, List[str], List[int]], text_pair: Union[str, List[str], List[int], None] = None, add_special_tokens: bool = True, padding: Union[bool, str, transformers.tokenization_utils_base.PaddingStrategy] = False, truncation: Union[bool, str, transformers.tokenization_utils_base.TruncationStrategy] = False, max_length: Optional[int] = None, stride: int = 0, return_tensors: Union[str, transformers.tokenization_utils_base.TensorType, None] = None, **kwargs) → List[int][source]

Converts a string to a sequence of ids (integer), using the tokenizer and vocabulary.

Same as doing self.convert_tokens_to_ids(self.tokenize(text)).

Parameters:
  • text (str, List[str] or List[int]) – The first sequence to be encoded. This can be a string, a list of strings (tokenized string using the tokenize method) or a list of integers (tokenized string ids using the convert_tokens_to_ids method).
  • text_pair (str, List[str] or List[int], optional) – Optional second sequence to be encoded. This can be a string, a list of strings (tokenized string using the tokenize method) or a list of integers (tokenized string ids using the convert_tokens_to_ids method).
  • add_special_tokens (bool, optional, defaults to True) – Whether or not to encode the sequences with the special tokens relative to their model.
  • padding (bool, str or PaddingStrategy, optional, defaults to False) –

    Activates and controls padding. Accepts the following values:

    • True or 'longest': Pad to the longest sequence in the batch (or no padding if only a single sequence if provided).
    • 'max_length': Pad to a maximum length specified with the argument max_length or to the maximum acceptable input length for the model if that argument is not provided.
    • False or 'do_not_pad' (default): No padding (i.e., can output a batch with sequences of different lengths).
  • truncation (bool, str or TruncationStrategy, optional, defaults to False) –

    Activates and controls truncation. Accepts the following values:

    • True or 'longest_first': Truncate to a maximum length specified with the argument max_length or to the maximum acceptable input length for the model if that argument is not provided. This will truncate token by token, removing a token from the longest sequence in the pair if a pair of sequences (or a batch of pairs) is provided.
    • 'only_first': Truncate to a maximum length specified with the argument max_length or to the maximum acceptable input length for the model if that argument is not provided. This will only truncate the first sequence of a pair if a pair of sequences (or a batch of pairs) is provided.
    • 'only_second': Truncate to a maximum length specified with the argument max_length or to the maximum acceptable input length for the model if that argument is not provided. This will only truncate the second sequence of a pair if a pair of sequences (or a batch of pairs) is provided.
    • False or 'do_not_truncate' (default): No truncation (i.e., can output batch with sequence lengths greater than the model maximum admissible input size).
  • max_length (int, optional) –

    Controls the maximum length to use by one of the truncation/padding parameters.

    If left unset or set to None, this will use the predefined model maximum length if a maximum length is required by one of the truncation/padding parameters. If the model has no specific maximum input length (like XLNet) truncation/padding to a maximum length will be deactivated.

  • stride (int, optional, defaults to 0) – If set to a number along with max_length, the overflowing tokens returned when return_overflowing_tokens=True will contain some tokens from the end of the truncated sequence returned to provide some overlap between truncated and overflowing sequences. The value of this argument defines the number of overlapping tokens.
  • is_split_into_words (bool, optional, defaults to False) – Whether or not the input is already pre-tokenized (e.g., split into words), in which case the tokenizer will skip the pre-tokenization step. This is useful for NER or token classification.
  • pad_to_multiple_of (int, optional) – If set will pad the sequence to a multiple of the provided value. This is especially useful to enable the use of Tensor Cores on NVIDIA hardware with compute capability >= 7.5 (Volta).
  • return_tensors (str or TensorType, optional) –

    If set, will return tensors instead of list of python integers. Acceptable values are:

    • 'tf': Return TensorFlow tf.constant objects.
    • 'pt': Return PyTorch torch.Tensor objects.
    • 'np': Return Numpy np.ndarray objects.
  • **kwargs – Passed along to the .tokenize() method.
Returns:

The tokenized ids of the text.

Return type:

List[int], torch.Tensor, tf.Tensor or np.ndarray

encode_plus(text: Union[str, List[str], List[int]], text_pair: Union[str, List[str], List[int], None] = None, add_special_tokens: bool = True, padding: Union[bool, str, transformers.tokenization_utils_base.PaddingStrategy] = False, truncation: Union[bool, str, transformers.tokenization_utils_base.TruncationStrategy] = False, max_length: Optional[int] = None, stride: int = 0, is_split_into_words: bool = False, pad_to_multiple_of: Optional[int] = None, return_tensors: Union[str, transformers.tokenization_utils_base.TensorType, None] = None, return_token_type_ids: Optional[bool] = None, return_attention_mask: Optional[bool] = None, return_overflowing_tokens: bool = False, return_special_tokens_mask: bool = False, return_offsets_mapping: bool = False, return_length: bool = False, verbose: bool = True, **kwargs) → transformers.tokenization_utils_base.BatchEncoding[source]

Tokenize and prepare for the model a sequence or a pair of sequences.

Warning

This method is deprecated, __call__ should be used instead.

Parameters:
  • text (str, List[str] or List[int] (the latter only for not-fast tokenizers)) – The first sequence to be encoded. This can be a string, a list of strings (tokenized string using the tokenize method) or a list of integers (tokenized string ids using the convert_tokens_to_ids method).
  • text_pair (str, List[str] or List[int], optional) – Optional second sequence to be encoded. This can be a string, a list of strings (tokenized string using the tokenize method) or a list of integers (tokenized string ids using the convert_tokens_to_ids method).
  • add_special_tokens (bool, optional, defaults to True) – Whether or not to encode the sequences with the special tokens relative to their model.
  • padding (bool, str or PaddingStrategy, optional, defaults to False) –

    Activates and controls padding. Accepts the following values:

    • True or 'longest': Pad to the longest sequence in the batch (or no padding if only a single sequence if provided).
    • 'max_length': Pad to a maximum length specified with the argument max_length or to the maximum acceptable input length for the model if that argument is not provided.
    • False or 'do_not_pad' (default): No padding (i.e., can output a batch with sequences of different lengths).
  • truncation (bool, str or TruncationStrategy, optional, defaults to False) –

    Activates and controls truncation. Accepts the following values:

    • True or 'longest_first': Truncate to a maximum length specified with the argument max_length or to the maximum acceptable input length for the model if that argument is not provided. This will truncate token by token, removing a token from the longest sequence in the pair if a pair of sequences (or a batch of pairs) is provided.
    • 'only_first': Truncate to a maximum length specified with the argument max_length or to the maximum acceptable input length for the model if that argument is not provided. This will only truncate the first sequence of a pair if a pair of sequences (or a batch of pairs) is provided.
    • 'only_second': Truncate to a maximum length specified with the argument max_length or to the maximum acceptable input length for the model if that argument is not provided. This will only truncate the second sequence of a pair if a pair of sequences (or a batch of pairs) is provided.
    • False or 'do_not_truncate' (default): No truncation (i.e., can output batch with sequence lengths greater than the model maximum admissible input size).
  • max_length (int, optional) –

    Controls the maximum length to use by one of the truncation/padding parameters.

    If left unset or set to None, this will use the predefined model maximum length if a maximum length is required by one of the truncation/padding parameters. If the model has no specific maximum input length (like XLNet) truncation/padding to a maximum length will be deactivated.

  • stride (int, optional, defaults to 0) – If set to a number along with max_length, the overflowing tokens returned when return_overflowing_tokens=True will contain some tokens from the end of the truncated sequence returned to provide some overlap between truncated and overflowing sequences. The value of this argument defines the number of overlapping tokens.
  • is_split_into_words (bool, optional, defaults to False) – Whether or not the input is already pre-tokenized (e.g., split into words), in which case the tokenizer will skip the pre-tokenization step. This is useful for NER or token classification.
  • pad_to_multiple_of (int, optional) – If set will pad the sequence to a multiple of the provided value. This is especially useful to enable the use of Tensor Cores on NVIDIA hardware with compute capability >= 7.5 (Volta).
  • return_tensors (str or TensorType, optional) –

    If set, will return tensors instead of list of python integers. Acceptable values are:

    • 'tf': Return TensorFlow tf.constant objects.
    • 'pt': Return PyTorch torch.Tensor objects.
    • 'np': Return Numpy np.ndarray objects.
  • return_token_type_ids (bool, optional) –

    Whether to return token type IDs. If left to the default, will return the token type IDs according to the specific tokenizer’s default, defined by the return_outputs attribute.

    What are token type IDs?

  • return_attention_mask (bool, optional) –

    Whether to return the attention mask. If left to the default, will return the attention mask according to the specific tokenizer’s default, defined by the return_outputs attribute.

    What are attention masks?

  • return_overflowing_tokens (bool, optional, defaults to False) – Whether or not to return overflowing token sequences.
  • return_special_tokens_mask (bool, optional, defaults to False) – Wheter or not to return special tokens mask information.
  • return_offsets_mapping (bool, optional, defaults to False) –

    Whether or not to return (char_start, char_end) for each token.

    This is only available on fast tokenizers inheriting from PreTrainedTokenizerFast, if using Python’s tokenizer, this method will raise NotImplementedError.

  • return_length (bool, optional, defaults to False) – Whether or not to return the lengths of the encoded inputs.
  • verbose (bool, optional, defaults to True) – Whether or not to print informations and warnings.
  • **kwargs – passed to the self.tokenize() method
Returns:

A BatchEncoding with the following fields:

  • input_ids – List of token ids to be fed to a model.

    What are input IDs?

  • token_type_ids – List of token type ids to be fed to a model (when return_token_type_ids=True or if “token_type_ids” is in self.model_input_names).

    What are token type IDs?

  • attention_mask – List of indices specifying which tokens should be attended to by the model (when return_attention_mask=True or if “attention_mask” is in self.model_input_names).

    What are attention masks?

  • overflowing_tokens – List of overflowing tokens sequences (when a max_length is specified and return_overflowing_tokens=True).

  • num_truncated_tokens – Number of tokens truncated (when a max_length is specified and return_overflowing_tokens=True).

  • special_tokens_mask – List of 0s and 1s, with 0 specifying added special tokens and 1 specifying regual sequence tokens (when add_special_tokens=True and return_special_tokens_mask=True).

  • length – The length of the inputs (when return_length=True)

Return type:

BatchEncoding

eos_token

End of sentence token. Log an error if used while not having been set.

Type:str
eos_token_id

Id of the end of sentence token in the vocabulary. Returns None if the token has not been set.

Type:Optional[int]
classmethod from_pretrained(*inputs, **kwargs)[source]

Instantiate a PreTrainedTokenizerBase (or a derived class) from a predefined tokenizer.

Parameters:
  • pretrained_model_name_or_path (str) –

    Can be either:

    • A string with the shortcut name of a predefined tokenizer to load from cache or download, e.g., bert-base-uncased.
    • A string with the identifier name of a predefined tokenizer that was user-uploaded to our S3, e.g., dbmdz/bert-base-german-cased.
    • A path to a directory containing vocabulary files required by the tokenizer, for instance saved using the save_pretrained() method, e.g., ./my_model_directory/.
    • (Deprecated, not applicable to all derived classes) A path or url to a single saved vocabulary file (if and only if the tokenizer only requires a single vocabulary file like Bert or XLNet), e.g., ./my_model_directory/vocab.txt.
  • cache_dir (str, optional) – Path to a directory in which a downloaded predefined tokenizer vocabulary files should be cached if the standard cache should not be used.
  • force_download (bool, optional, defaults to False) – Whether or not to force the (re-)download the vocabulary files and override the cached versions if they exist.
  • resume_download (bool, optional, defaults to False) – Whether or not to delete incompletely received files. Attempt to resume the download if such a file exists.
  • proxies (Dict[str, str], `optional) – A dictionary of proxy servers to use by protocol or endpoint, e.g., {'http': 'foo.bar:3128', 'http://hostname': 'foo.bar:4012'}. The proxies are used on each request.
  • inputs (additional positional arguments, optional) – Will be passed along to the Tokenizer __init__ method.
  • kwargs (additional keyword arguments, optional) – Will be passed to the Tokenizer __init__ method. Can be used to set special tokens like bos_token, eos_token, unk_token, sep_token, pad_token, cls_token, mask_token, additional_special_tokens. See parameters in the __init__ for more details.

Examples:

# We can't instantiate directly the base class `PreTrainedTokenizerBase` so let's show our examples on a derived class: BertTokenizer
# Download vocabulary from S3 and cache.
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

# Download vocabulary from S3 (user-uploaded) and cache.
tokenizer = BertTokenizer.from_pretrained('dbmdz/bert-base-german-cased')

# If vocabulary files are in a directory (e.g. tokenizer was saved using `save_pretrained('./test/saved_model/')`)
tokenizer = BertTokenizer.from_pretrained('./test/saved_model/')

# If the tokenizer uses a single vocabulary file, you can point directly to this file
tokenizer = BertTokenizer.from_pretrained('./test/saved_model/my_vocab.txt')

# You can link tokens to special vocabulary when instantiating
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', unk_token='<unk>')
# You should be sure '<unk>' is in the vocabulary when doing that.
# Otherwise use tokenizer.add_special_tokens({'unk_token': '<unk>'}) instead)
assert tokenizer.unk_token == '<unk>'
get_added_vocab() → Dict[str, int][source]

Returns the added tokens in the vocabulary as a dictionary of token to index.

Returns:The added tokens.
Return type:Dict[str, int]
get_special_tokens_mask(token_ids_0: List[int], token_ids_1: Optional[List[int]] = None, already_has_special_tokens: bool = False) → List[int][source]

Retrieves sequence ids from a token list that has no special tokens added. This method is called when adding special tokens using the tokenizer prepare_for_model method.

Parameters:
  • token_ids_0 (List[int]) – List of ids.
  • token_ids_1 (List[int], optional) – Optional second list of IDs for sequence pairs.
  • already_has_special_tokens (bool, optional, defaults to False) – Set to True if the token list is already formatted with special tokens for the model
Returns:

A list of integers in the range [0, 1]: 1 for a special token, 0 for a sequence token.

Return type:

List[int]

get_vocab()[source]

Returns the vocabulary as a dictionary of token to index.

tokenizer.get_vocab()[token] is equivalent to tokenizer.convert_tokens_to_ids(token) when token is in the vocab.

Returns:The vocabulary.
Return type:Dict[str, int]
mask_token

Mask token, to use when training a model with masked-language modeling. Log an error if used while not having been set.

Type:str
mask_token_id

Id of the mask token in the vocabulary, used when training a model with masked-language modeling. Returns None if the token has not been set.

Type:Optional[int]
max_len

Deprecated Kept here for backward compatibility. Now renamed to model_max_length to avoid ambiguity.

Type:int
max_len_sentences_pair

The maximum combined length of a pair of sentences that can be fed to the model.

Type:int
max_len_single_sentence

The maximum length of a sentence that can be fed to the model.

Type:int
num_special_tokens_to_add(pair: bool = False) → int[source]

Returns the number of added tokens when encoding a sequence with special tokens.

Note

This encodes a dummy input and checks the number of added tokens, and is therefore not efficient. Do not put this inside your training loop.

Parameters:pair (bool, optional, defaults to False) – Whether the number of added tokens should be computed in the case of a sequence pair or a single sequence.
Returns:Number of special tokens added to sequences.
Return type:int
pad(encoded_inputs: Union[transformers.tokenization_utils_base.BatchEncoding, List[transformers.tokenization_utils_base.BatchEncoding], Dict[str, List[int]], Dict[str, List[List[int]]], List[Dict[str, List[int]]]], padding: Union[bool, str, transformers.tokenization_utils_base.PaddingStrategy] = True, max_length: Optional[int] = None, pad_to_multiple_of: Optional[int] = None, return_attention_mask: Optional[bool] = None, return_tensors: Union[str, transformers.tokenization_utils_base.TensorType, None] = None, verbose: bool = True) → transformers.tokenization_utils_base.BatchEncoding[source]

Pad a single encoded input or a batch of encoded inputs up to predefined length or to the max sequence length in the batch.

Padding side (left/right) padding token ids are defined at the tokenizer level (with self.padding_side, self.pad_token_id and self.pad_token_type_id)

Note

If the encoded_inputs passed are dictionary of numpy arrays, PyTorch tensors or TensorFlow tensors, the result will use the same type unless you provide a different tensor type with return_tensors. In the case of PyTorch tensors, you will lose the specific device of your tensors however.

Parameters:
  • encoded_inputs (BatchEncoding, list of BatchEncoding, Dict[str, List[int]], Dict[str, List[List[int]] or List[Dict[str, List[int]]]) –

    Tokenized inputs. Can represent one input (BatchEncoding or Dict[str, List[int]]) or a batch of tokenized inputs (list of BatchEncoding, Dict[str, List[List[int]]] or List[Dict[str, List[int]]]) so you can use this method during preprocessing as well as in a PyTorch Dataloader collate function.

    Instead of List[int] you can have tensors (numpy arrays, PyTorch tensors or TensorFlow tensors), see the note above for the return type.

  • padding (bool, str or PaddingStrategy, optional, defaults to False) –
    Select a strategy to pad the returned sequences (according to the model’s padding side and padding
    index) among:
    • True or 'longest': Pad to the longest sequence in the batch (or no padding if only a single sequence if provided).
    • 'max_length': Pad to a maximum length specified with the argument max_length or to the maximum acceptable input length for the model if that argument is not provided.
    • False or 'do_not_pad' (default): No padding (i.e., can output a batch with sequences of different lengths).
  • max_length (int, optional) – Maximum length of the returned list and optionally padding length (see above).
  • pad_to_multiple_of (int, optional) –

    If set will pad the sequence to a multiple of the provided value.

    This is especially useful to enable the use of Tensor Cores on NVIDIA hardware with compute capability >= 7.5 (Volta).

  • return_attention_mask (bool, optional) –

    Whether to return the attention mask. If left to the default, will return the attention mask according to the specific tokenizer’s default, defined by the return_outputs attribute.

    What are attention masks?

  • return_tensors (str or TensorType, optional) –

    If set, will return tensors instead of list of python integers. Acceptable values are:

    • 'tf': Return TensorFlow tf.constant objects.
    • 'pt': Return PyTorch torch.Tensor objects.
    • 'np': Return Numpy np.ndarray objects.
  • verbose (bool, optional, defaults to True) – Whether or not to print informations and warnings.
pad_token

Padding token. Log an error if used while not having been set.

Type:str
pad_token_id

Id of the padding token in the vocabulary. Returns None if the token has not been set.

Type:Optional[int]
pad_token_type_id

Id of the padding token type in the vocabulary.

Type:int
prepare_for_model(ids: List[int], pair_ids: Optional[List[int]] = None, add_special_tokens: bool = True, padding: Union[bool, str, transformers.tokenization_utils_base.PaddingStrategy] = False, truncation: Union[bool, str, transformers.tokenization_utils_base.TruncationStrategy] = False, max_length: Optional[int] = None, stride: int = 0, pad_to_multiple_of: Optional[int] = None, return_tensors: Union[str, transformers.tokenization_utils_base.TensorType, None] = None, return_token_type_ids: Optional[bool] = None, return_attention_mask: Optional[bool] = None, return_overflowing_tokens: bool = False, return_special_tokens_mask: bool = False, return_offsets_mapping: bool = False, return_length: bool = False, verbose: bool = True, prepend_batch_axis: bool = False, **kwargs) → transformers.tokenization_utils_base.BatchEncoding[source]

Prepares a sequence of input id, or a pair of sequences of inputs ids so that it can be used by the model. It adds special tokens, truncates sequences if overflowing while taking into account the special tokens and manages a moving window (with user defined stride) for overflowing tokens

Parameters:
  • ids (List[int]) – Tokenized input ids of the first sequence. Can be obtained from a string by chaining the tokenize and convert_tokens_to_ids methods.
  • pair_ids (List[int], optional) – Tokenized input ids of the second sequence. Can be obtained from a string by chaining the tokenize and convert_tokens_to_ids methods.
  • add_special_tokens (bool, optional, defaults to True) – Whether or not to encode the sequences with the special tokens relative to their model.
  • padding (bool, str or PaddingStrategy, optional, defaults to False) –

    Activates and controls padding. Accepts the following values:

    • True or 'longest': Pad to the longest sequence in the batch (or no padding if only a single sequence if provided).
    • 'max_length': Pad to a maximum length specified with the argument max_length or to the maximum acceptable input length for the model if that argument is not provided.
    • False or 'do_not_pad' (default): No padding (i.e., can output a batch with sequences of different lengths).
  • truncation (bool, str or TruncationStrategy, optional, defaults to False) –

    Activates and controls truncation. Accepts the following values:

    • True or 'longest_first': Truncate to a maximum length specified with the argument max_length or to the maximum acceptable input length for the model if that argument is not provided. This will truncate token by token, removing a token from the longest sequence in the pair if a pair of sequences (or a batch of pairs) is provided.
    • 'only_first': Truncate to a maximum length specified with the argument max_length or to the maximum acceptable input length for the model if that argument is not provided. This will only truncate the first sequence of a pair if a pair of sequences (or a batch of pairs) is provided.
    • 'only_second': Truncate to a maximum length specified with the argument max_length or to the maximum acceptable input length for the model if that argument is not provided. This will only truncate the second sequence of a pair if a pair of sequences (or a batch of pairs) is provided.
    • False or 'do_not_truncate' (default): No truncation (i.e., can output batch with sequence lengths greater than the model maximum admissible input size).
  • max_length (int, optional) –

    Controls the maximum length to use by one of the truncation/padding parameters.

    If left unset or set to None, this will use the predefined model maximum length if a maximum length is required by one of the truncation/padding parameters. If the model has no specific maximum input length (like XLNet) truncation/padding to a maximum length will be deactivated.

  • stride (int, optional, defaults to 0) – If set to a number along with max_length, the overflowing tokens returned when return_overflowing_tokens=True will contain some tokens from the end of the truncated sequence returned to provide some overlap between truncated and overflowing sequences. The value of this argument defines the number of overlapping tokens.
  • is_split_into_words (bool, optional, defaults to False) – Whether or not the input is already pre-tokenized (e.g., split into words), in which case the tokenizer will skip the pre-tokenization step. This is useful for NER or token classification.
  • pad_to_multiple_of (int, optional) – If set will pad the sequence to a multiple of the provided value. This is especially useful to enable the use of Tensor Cores on NVIDIA hardware with compute capability >= 7.5 (Volta).
  • return_tensors (str or TensorType, optional) –

    If set, will return tensors instead of list of python integers. Acceptable values are:

    • 'tf': Return TensorFlow tf.constant objects.
    • 'pt': Return PyTorch torch.Tensor objects.
    • 'np': Return Numpy np.ndarray objects.
  • return_token_type_ids (bool, optional) –

    Whether to return token type IDs. If left to the default, will return the token type IDs according to the specific tokenizer’s default, defined by the return_outputs attribute.

    What are token type IDs?

  • return_attention_mask (bool, optional) –

    Whether to return the attention mask. If left to the default, will return the attention mask according to the specific tokenizer’s default, defined by the return_outputs attribute.

    What are attention masks?

  • return_overflowing_tokens (bool, optional, defaults to False) – Whether or not to return overflowing token sequences.
  • return_special_tokens_mask (bool, optional, defaults to False) – Wheter or not to return special tokens mask information.
  • return_offsets_mapping (bool, optional, defaults to False) –

    Whether or not to return (char_start, char_end) for each token.

    This is only available on fast tokenizers inheriting from PreTrainedTokenizerFast, if using Python’s tokenizer, this method will raise NotImplementedError.

  • return_length (bool, optional, defaults to False) – Whether or not to return the lengths of the encoded inputs.
  • verbose (bool, optional, defaults to True) – Whether or not to print informations and warnings.
  • **kwargs – passed to the self.tokenize() method
Returns:

A BatchEncoding with the following fields:

  • input_ids – List of token ids to be fed to a model.

    What are input IDs?

  • token_type_ids – List of token type ids to be fed to a model (when return_token_type_ids=True or if “token_type_ids” is in self.model_input_names).

    What are token type IDs?

  • attention_mask – List of indices specifying which tokens should be attended to by the model (when return_attention_mask=True or if “attention_mask” is in self.model_input_names).

    What are attention masks?

  • overflowing_tokens – List of overflowing tokens sequences (when a max_length is specified and return_overflowing_tokens=True).

  • num_truncated_tokens – Number of tokens truncated (when a max_length is specified and return_overflowing_tokens=True).

  • special_tokens_mask – List of 0s and 1s, with 0 specifying added special tokens and 1 specifying regual sequence tokens (when add_special_tokens=True and return_special_tokens_mask=True).

  • length – The length of the inputs (when return_length=True)

Return type:

BatchEncoding

prepare_for_tokenization(text: str, is_split_into_words: bool = False, **kwargs) → Tuple[str, Dict[str, Any]][source]

Performs any necessary transformations before tokenization.

This method should pop the arguments from kwargs and return the remaining kwargs as well. We test the kwargs at the end of the encoding process to be sure all the arguments have been used.

Parameters:
  • test (str) – The text to prepare.
  • is_split_into_words (bool, optional, defaults to False) – Whether or not the text has been pretokenized.
  • kwargs – Keyword arguments to use for the tokenization.
Returns:

The prepared text and the unused kwargs.

Return type:

Tuple[str, Dict[str, Any]]

prepare_seq2seq_batch(src_texts: List[str], tgt_texts: Optional[List[str]] = None, max_length: Optional[int] = None, max_target_length: Optional[int] = None, padding: str = 'longest', return_tensors: str = 'None', truncation=True, **kwargs) → transformers.tokenization_utils_base.BatchEncoding[source]

Prepare a batch that can be passed directly to an instance of AutoModelForSeq2SeqLM.

Parameters:
  • src_texts – (List[str]): List of documents to summarize or source language texts.
  • tgt_texts – (List[str], optional): List of summaries or target language texts.
  • max_length (int, optional) – Controls the maximum length for encoder inputs (documents to summarize or source language texts). If left unset or set to None, this will use the predefined model maximum length if a maximum length is required by one of the truncation/padding parameters. If the model has no specific maximum input length (like XLNet) truncation/padding to a maximum length will be deactivated.
  • max_target_length (int, optional) – Controls the maximum length of decoder inputs (target language texts or summaries). If left unset or set to None, this will use the max_length value.
  • padding (bool, str or PaddingStrategy, optional, defaults to False) –

    Activates and controls padding. Accepts the following values:

    • True or 'longest': Pad to the longest sequence in the batch (or no padding if only a single sequence if provided).
    • 'max_length': Pad to a maximum length specified with the argument max_length or to the maximum acceptable input length for the model if that argument is not provided.
    • False or 'do_not_pad' (default): No padding (i.e., can output a batch with sequences of different lengths).
  • return_tensors (str or TensorType, optional, defaults to “pt”) –

    If set, will return tensors instead of list of python integers. Acceptable values are:

    • 'tf': Return TensorFlow tf.constant objects.
    • 'pt': Return PyTorch torch.Tensor objects.
    • 'np': Return Numpy np.ndarray objects.
  • truncation (bool, str or TruncationStrategy, optional, defaults to True) –

    Activates and controls truncation. Accepts the following values:

    • True or 'longest_first': Truncate to a maximum length specified with the argument max_length or to the maximum acceptable input length for the model if that argument is not provided. This will truncate token by token, removing a token from the longest sequence in the pair if a pair of sequences (or a batch of pairs) is provided.
    • 'only_first': Truncate to a maximum length specified with the argument max_length or to the maximum acceptable input length for the model if that argument is not provided. This will only truncate the first sequence of a pair if a pair of sequences (or a batch of pairs) is provided.
    • 'only_second': Truncate to a maximum length specified with the argument max_length or to the maximum acceptable input length for the model if that argument is not provided. This will only truncate the second sequence of a pair if a pair of sequences (or a batch of pairs) is provided.
    • False or 'do_not_truncate' (default): No truncation (i.e., can output batch with sequence lengths greater than the model maximum admissible input size).
  • **kwargs – Additional keyword arguments passed along to self.__call__.
Returns:

A BatchEncoding with the following fields:

  • input_ids – List of token ids to be fed to the encoder.
  • attention_mask – List of indices specifying which tokens should be attended to by the model.
  • labels – List of token ids for tgt_texts

The full set of keys [input_ids, attention_mask, labels], will only be returned if tgt_texts is passed. Otherwise, input_ids, attention_mask will be the only keys.

Return type:

BatchEncoding

sanitize_special_tokens() → int[source]

Make sure that all the special tokens attributes of the tokenizer (tokenizer.mask_token, tokenizer.cls_token, etc.) are in the vocabulary.

Add the missing ones to the vocabulary if needed.

Returns:The number of tokens added in the vocaulary during the operation.
Return type:int
save_pretrained(save_directory: str) → Tuple[str][source]

Save the tokenizer vocabulary files together with:

  • added tokens,
  • special tokens to class attributes mapping,
  • tokenizer instantiation positional and keywords inputs (e.g. do_lower_case for Bert).

This method make sure the full tokenizer can then be re-loaded using the from_pretrained() class method.

Warning

This won’t save modifications you may have applied to the tokenizer after the instantiation (for instance, modifying tokenizer.do_lower_case after creation).

Parameters:save_directory (str) – The path to adirectory where the tokenizer will be saved.
Returns:The files saved.
Return type:A tuple of str
save_vocabulary(vocab_path: str)[source]

Save the tokenizer vocabulary to a file.

Parameters:vocab_path (obj: str) – The directory in which to save the SMILES character per line vocabulary file. Default vocab file is found in deepchem/feat/tests/data/vocab.txt
Returns:vocab_file – Paths to the files saved. typle with string to a SMILES character per line vocabulary file. Default vocab file is found in deepchem/feat/tests/data/vocab.txt
Return type:Tuple(str):
sep_token

Separation token, to separate context and query in an input sequence. Log an error if used while not having been set.

Type:str
sep_token_id

Id of the separation token in the vocabulary, to separate context and query in an input sequence. Returns None if the token has not been set.

Type:Optional[int]
special_tokens_map

A dictionary mapping special token class attributes (cls_token, unk_token, etc.) to their values ('<unk>', '<cls>', etc.).

Convert potential tokens of tokenizers.AddedToken type to string.

Type:Dict[str, Union[str, List[str]]]
special_tokens_map_extended

A dictionary mapping special token class attributes (cls_token, unk_token, etc.) to their values ('<unk>', '<cls>', etc.).

Don’t convert tokens of tokenizers.AddedToken type to string so they can be used to control more finely how special tokens are tokenized.

Type:Dict[str, Union[str, tokenizers.AddedToken, List[Union[str, tokenizers.AddedToken]]]]
tokenize(text: str, **kwargs) → List[str][source]

Converts a string in a sequence of tokens, using the tokenizer.

Split in words for word-based vocabulary or sub-words for sub-word-based vocabularies (BPE/SentencePieces/WordPieces). Takes care of added tokens.

Parameters:
  • text (str) – The sequence to be encoded.
  • **kwargs (additional keyword arguments) – Passed along to the model-specific prepare_for_tokenization preprocessing method.
Returns:

The list of tokens.

Return type:

List[str]

truncate_sequences(ids: List[int], pair_ids: Optional[List[int]] = None, num_tokens_to_remove: int = 0, truncation_strategy: Union[str, transformers.tokenization_utils_base.TruncationStrategy] = 'longest_first', stride: int = 0) → Tuple[List[int], List[int], List[int]][source]

Truncates a sequence pair in-place following the strategy.

Parameters:
  • ids (List[int]) – Tokenized input ids of the first sequence. Can be obtained from a string by chaining the tokenize and convert_tokens_to_ids methods.
  • pair_ids (List[int], optional) – Tokenized input ids of the second sequence. Can be obtained from a string by chaining the tokenize and convert_tokens_to_ids methods.
  • num_tokens_to_remove (int, optional, defaults to 0) – Number of tokens to remove using the truncation strategy.
  • truncation (str or TruncationStrategy, optional, defaults to False) –

    The strategy to follow for truncation. Can be:

    • 'longest_first': Truncate to a maximum length specified with the argument max_length or to the maximum acceptable input length for the model if that argument is not provided. This will truncate token by token, removing a token from the longest sequence in the pair if a pair of sequences (or a batch of pairs) is provided.
    • 'only_first': Truncate to a maximum length specified with the argument max_length or to the maximum acceptable input length for the model if that argument is not provided. This will only truncate the first sequence of a pair if a pair of sequences (or a batch of pairs) is provided.
    • 'only_second': Truncate to a maximum length specified with the argument max_length or to the maximum acceptable input length for the model if that argument is not provided. This will only truncate the second sequence of a pair if a pair of sequences (or a batch of pairs) is provided.
    • 'do_not_truncate' (default): No truncation (i.e., can output batch with sequence lengths greater than the model maximum admissible input size).
  • max_length (int, optional) –

    Controls the maximum length to use by one of the truncation/padding parameters.

    If left unset or set to None, this will use the predefined model maximum length if a maximum length is required by one of the truncation/padding parameters. If the model has no specific maximum input length (like XLNet) truncation/padding to a maximum length will be deactivated.

  • stride (int, optional, defaults to 0) – If set to a positive number, the overflowing tokens returned will contain some tokens from the main sequence returned. The value of this argument defines the number of additional tokens.
Returns:

The truncated ids, the truncated pair_ids and the list of overflowing tokens.

Return type:

Tuple[List[int], List[int], List[int]]

unk_token

Unknown token. Log an error if used while not having been set.

Type:str
unk_token_id

Id of the unknown token in the vocabulary. Returns None if the token has not been set.

Type:Optional[int]
vocab_size

Size of the base vocabulary (without the added tokens).

Type:int

BasicSmilesTokenizer

The dc.feat.BasicSmilesTokenizer module uses a regex tokenization pattern to tokenise SMILES strings. The regex is developed by Schwaller et. al. The tokenizer is to be used on SMILES in cases where the user wishes to not rely on the transformers API.

References: - Molecular Transformer: Unsupervised Attention-Guided Atom-Mapping

class deepchem.feat.BasicSmilesTokenizer(regex_pattern: str = '(\[[^\]]+]|Br?|Cl?|N|O|S|P|F|I|b|c|n|o|s|p|\(|\)|\.|=|n#|-|\+|\\|\/|:|~|@|\?|>>?|\*|\$|\%[0-9]{2}|[0-9])')[source]

Run basic SMILES tokenization using a regex pattern developed by Schwaller et. al. This tokenizer is to be used when a tokenizer that does not require the transformers library by HuggingFace is required.

Examples

>>> from deepchem.feat.smiles_tokenizer import BasicSmilesTokenizer
>>> tokenizer = BasicSmilesTokenizer()
>>> print(tokenizer.tokenize("CC(=O)OC1=CC=CC=C1C(=O)O"))
['C', 'C', '(', '=', 'O', ')', 'O', 'C', '1', '=', 'C', 'C', '=', 'C', 'C', '=', 'C', '1', 'C', '(', '=', 'O', ')', 'O']

References

[1]Philippe Schwaller, Teodoro Laino, Théophile Gaudin, Peter Bolgar, Christopher A. Hunter, Costas Bekas, and Alpha A. Lee ACS Central Science 2019 5 (9): Molecular Transformer: A Model for Uncertainty-Calibrated Chemical Reaction Prediction 1572-1583 DOI: 10.1021/acscentsci.9b00576
__init__(regex_pattern: str = '(\\[[^\\]]+]|Br?|Cl?|N|O|S|P|F|I|b|c|n|o|s|p|\\(|\\)|\\.|=|\n#|-|\\+|\\\\|\\/|:|~|@|\\?|>>?|\\*|\\$|\\%[0-9]{2}|[0-9])')[source]

Constructs a BasicSMILESTokenizer. :param regex: SMILES token regex :type regex: string

tokenize(text)[source]

Basic Tokenization of a SMILES.