atomgen.data.tokenizer module#

tokenization module for atom modeling.

class AtomTokenizer(vocab_file, pad_token='<pad>', mask_token='<mask>', bos_token='<bos>', eos_token='<eos>', cls_token='<graph>', **kwargs)[source]#

Bases: PreTrainedTokenizer

Tokenizer for atomistic data.

Args:: vocab_file: The path to the vocabulary file. pad_token: The padding token. mask_token: The mask token. bos_token: The beginning of system token. eos_token: The end of system token. cls_token: The classification token. kwargs: Additional keyword arguments.

build_inputs_with_special_tokens(token_ids_0, token_ids_1=None)[source]#

Build the input with special tokens.

Return type:: List[int]

convert_tokens_to_string(tokens)[source]#

Convert the list of chemical symbol tokens to a concatenated string.

Return type:: str

classmethod from_pretrained(*inputs, **kwargs)[source]#

Load the tokenizer from a pretrained model.

Return type:: Any

get_vocab()[source]#

Get the vocabulary.

Return type:: Dict[str, int]

get_vocab_size()[source]#

Get the size of the vocabulary.

Return type:: int

static load_vocab(vocab_file)[source]#

Load the vocabulary from a json file.

Return type:: Dict[str, int]

pad(encoded_inputs, padding=True, max_length=None, pad_to_multiple_of=None, return_attention_mask=None, return_tensors=None, verbose=True)[source]#