atomgen.data.tokenizer module#
tokenization module for atom modeling.
- class AtomTokenizer(vocab_file, pad_token='<pad>', mask_token='<mask>', bos_token='<bos>', eos_token='<eos>', cls_token='<graph>', **kwargs)[source]#
Bases:
PreTrainedTokenizer
Tokenizer for atomistic data.
- Args:
vocab_file: The path to the vocabulary file. pad_token: The padding token. mask_token: The mask token. bos_token: The beginning of system token. eos_token: The end of system token. cls_token: The classification token. kwargs: Additional keyword arguments.
- build_inputs_with_special_tokens(token_ids_0, token_ids_1=None)[source]#
Build the input with special tokens.
- convert_tokens_to_string(tokens)[source]#
Convert the list of chemical symbol tokens to a concatenated string.
- Return type:
- classmethod from_pretrained(*inputs, **kwargs)[source]#
Load the tokenizer from a pretrained model.
- Return type:
- pad(encoded_inputs, padding=True, max_length=None, pad_to_multiple_of=None, return_attention_mask=None, return_tensors=None, verbose=True)[source]#
Pad the input data.
- Return type:
BatchEncoding
- pad_coords(batch, max_length=None, pad_to_multiple_of=None)[source]#
Pad the coordinates to the same length.
- pad_fixed(batch, max_length=None, pad_to_multiple_of=None)[source]#
Pad the fixed mask to the same length.