Tokenizers¶
A tokenizer is in charge of preparing the inputs for a natural language processing model. For many scientific applications, it is possible to treat inputs as “words”/”sentences” and use NLP methods to make meaningful predictions. For example, SMILES strings or DNA sequences have grammatical structure and can be usefully modeled with NLP techniques. DeepChem provides some scientifically relevant tokenizers for use in different applications. These tokenizers are based on those from the Huggingface transformers library (which DeepChem tokenizers inherit from).
The base classes PreTrainedTokenizer and PreTrainedTokenizerFast implements the common methods for encoding string inputs in model inputs and instantiating/saving python tokenizers either from a local file or directory or from a pretrained tokenizer provided by the library (downloaded from HuggingFace’s AWS S3 repository).
PreTrainedTokenizer (transformers.PreTrainedTokenizer) thus implements the main methods for using all the tokenizers:
Tokenizing (spliting strings in sub-word token strings), converting tokens strings to ids and back, and encoding/decoding (i.e. tokenizing + convert to integers),
Adding new tokens to the vocabulary in a way that is independant of the underlying structure (BPE, SentencePiece…),
Managing special tokens like mask, beginning-of-sentence, etc tokens (adding them, assigning them to attributes in the tokenizer for easy access and making sure they are not split during tokenization)
BatchEncoding holds the output of the tokenizer’s encoding methods (__call__, encode_plus and batch_encode_plus) and is derived from a Python dictionary. When the tokenizer is a pure python tokenizer, this class behave just like a standard python dictionary and hold the various model inputs computed by these methodes (input_ids, attention_mask…).
For more details on the base tokenizers which the DeepChem tokenizers inherit from, please refer to the following: HuggingFace tokenizers docs
Tokenization methods on string-based corpuses in the life sciences are becoming increasingly popular for NLP-based applications to chemistry and biology. One such example is ChemBERTa, a transformer for molecular property prediction. DeepChem offers a tutorial for utilizing ChemBERTa using an alternate tokenizer, a Byte-Piece Encoder, which can be found here.
SmilesTokenizer¶
The dc.feat.SmilesTokenizer
module inherits from the BertTokenizer class in transformers. It runs a WordPiece tokenization algorithm over SMILES strings using the tokenisation SMILES regex developed by Schwaller et. al.
The SmilesTokenizer employs an atom-wise tokenization strategy using the following Regex expression:
SMI_REGEX_PATTERN = "(\[[^\]]+]|Br?|Cl?|N|O|S|P|F|I|b|c|n|o|s|p|\(|\)|\.|=|#||\+|\\\\\/|:||@|\?|>|\*|\$|\%[0–9]{2}|[0–9])"
To use, please install the transformers package using the following pip command:
pip install transformers
References:
-
class
SmilesTokenizer
(vocab_file: str = '', **kwargs)[source]¶ Creates the SmilesTokenizer class. The tokenizer heavily inherits from the BertTokenizer implementation found in Huggingface’s transformers library. It runs a WordPiece tokenization algorithm over SMILES strings using the tokenisation SMILES regex developed by Schwaller et. al.
Please see https://github.com/huggingface/transformers and https://github.com/rxn4chemistry/rxnfp for more details.
Examples
>>> from deepchem.feat.smiles_tokenizer import SmilesTokenizer >>> current_dir = os.path.dirname(os.path.realpath(__file__)) >>> vocab_path = os.path.join(current_dir, 'tests/data', 'vocab.txt') >>> tokenizer = SmilesTokenizer(vocab_path) >>> print(tokenizer.encode("CC(=O)OC1=CC=CC=C1C(=O)O")) [12, 16, 16, 17, 22, 19, 18, 19, 16, 20, 22, 16, 16, 22, 16, 16, 22, 16, 20, 16, 17, 22, 19, 18, 19, 13]
References
- 1
Schwaller, Philippe; Probst, Daniel; Vaucher, Alain C.; Nair, Vishnu H; Kreutter, David; Laino, Teodoro; et al. (2019): Mapping the Space of Chemical Reactions using Attention-Based Neural Networks. ChemRxiv. Preprint. https://doi.org/10.26434/chemrxiv.9897365.v3
Notes
This class requires huggingface’s transformers and tokenizers libraries to be installed.
-
__init__
(vocab_file: str = '', **kwargs)[source]¶ Constructs a SmilesTokenizer.
- Parameters
vocab_file (str) – Path to a SMILES character per line vocabulary file. Default vocab file is found in deepchem/feat/tests/data/vocab.txt
-
convert_tokens_to_string
(tokens: List[str])[source]¶ Converts a sequence of tokens (string) in a single string.
- Parameters
tokens (List[str]) – List of tokens for a given string sequence.
- Returns
out_string – Single string from combined tokens.
- Return type
str
-
add_special_tokens_ids_single_sequence
(token_ids: List[int])[source]¶ Adds special tokens to the a sequence for sequence classification tasks. A BERT sequence has the following format: [CLS] X [SEP]
- Parameters
token_ids (list[int]) – list of tokenized input ids. Can be obtained using the encode or encode_plus methods.
-
add_special_tokens_single_sequence
(tokens: List[str])[source]¶ Adds special tokens to the a sequence for sequence classification tasks. A BERT sequence has the following format: [CLS] X [SEP]
- Parameters
tokens (List[str]) – List of tokens for a given string sequence.
-
add_special_tokens_ids_sequence_pair
(token_ids_0: List[int], token_ids_1: List[int]) → List[int][source]¶ Adds special tokens to a sequence pair for sequence classification tasks. A BERT sequence pair has the following format: [CLS] A [SEP] B [SEP]
- Parameters
token_ids_0 (List[int]) – List of ids for the first string sequence in the sequence pair (A).
token_ids_1 (List[int]) – List of tokens for the second string sequence in the sequence pair (B).
-
add_padding_tokens
(token_ids: List[int], length: int, right: bool = True) → List[int][source]¶ Adds padding tokens to return a sequence of length max_length. By default padding tokens are added to the right of the sequence.
- Parameters
token_ids (list[int]) – list of tokenized input ids. Can be obtained using the encode or encode_plus methods.
length (int) –
right (bool (True by default)) –
- Returns
token_ids – list of tokenized input ids. Can be obtained using the encode or encode_plus methods.
padding (int) – Integer to be added as padding token
-
save_vocabulary
(vocab_path: str)[source]¶ Save the tokenizer vocabulary to a file.
- Parameters
vocab_path (obj: str) – The directory in which to save the SMILES character per line vocabulary file. Default vocab file is found in deepchem/feat/tests/data/vocab.txt
- Returns
vocab_file – Paths to the files saved. typle with string to a SMILES character per line vocabulary file. Default vocab file is found in deepchem/feat/tests/data/vocab.txt
- Return type
Tuple(str)
:
BasicSmilesTokenizer¶
The dc.feat.BasicSmilesTokenizer
module uses a regex tokenization pattern to tokenise SMILES strings. The regex is developed by Schwaller et. al. The tokenizer is to be used on SMILES in cases where the user wishes to not rely on the transformers API.
References: - Molecular Transformer: Unsupervised Attention-Guided Atom-Mapping
-
class
BasicSmilesTokenizer
(regex_pattern: str = '(\\[[^\\]]+]|Br?|Cl?|N|O|S|P|F|I|b|c|n|o|s|p|\\(|\\)|\\.|=|\n#|-|\\+|\\\\|\\/|:|~|@|\\?|>>?|\\*|\\$|\\%[0-9]{2}|[0-9])')[source]¶ Run basic SMILES tokenization using a regex pattern developed by Schwaller et. al. This tokenizer is to be used when a tokenizer that does not require the transformers library by HuggingFace is required.
Examples
>>> from deepchem.feat.smiles_tokenizer import BasicSmilesTokenizer >>> tokenizer = BasicSmilesTokenizer() >>> print(tokenizer.tokenize("CC(=O)OC1=CC=CC=C1C(=O)O")) ['C', 'C', '(', '=', 'O', ')', 'O', 'C', '1', '=', 'C', 'C', '=', 'C', 'C', '=', 'C', '1', 'C', '(', '=', 'O', ')', 'O']
References
- 1
Philippe Schwaller, Teodoro Laino, Théophile Gaudin, Peter Bolgar, Christopher A. Hunter, Costas Bekas, and Alpha A. Lee ACS Central Science 2019 5 (9): Molecular Transformer: A Model for Uncertainty-Calibrated Chemical Reaction Prediction 1572-1583 DOI: 10.1021/acscentsci.9b00576