atomgen.data.tokenizer module#

tokenization module for atom modeling.

class AtomTokenizer(vocab_file, pad_token='<pad>', mask_token='<mask>', bos_token='<bos>', eos_token='<eos>', cls_token='<graph>', **kwargs)[source]#

Bases: PreTrainedTokenizer

Tokenizer for atomistic data.

Args:

vocab_file: The path to the vocabulary file. pad_token: The padding token. mask_token: The mask token. bos_token: The beginning of system token. eos_token: The end of system token. cls_token: The classification token. kwargs: Additional keyword arguments.

build_inputs_with_special_tokens(token_ids_0, token_ids_1=None)[source]#

Build the input with special tokens.

Return type:

List[int]

convert_tokens_to_string(tokens)[source]#

Convert the list of chemical symbol tokens to a concatenated string.

Return type:

str

classmethod from_pretrained(*inputs, **kwargs)[source]#

Load the tokenizer from a pretrained model.

Return type:

Any

get_vocab()[source]#

Get the vocabulary.

Return type:

Dict[str, int]

get_vocab_size()[source]#

Get the size of the vocabulary.

Return type:

int

static load_vocab(vocab_file)[source]#

Load the vocabulary from a json file.

Return type:

Dict[str, int]

pad(encoded_inputs, padding=True, max_length=None, pad_to_multiple_of=None, return_attention_mask=None, return_tensors=None, verbose=True)[source]#

Pad the input data.

Return type:

BatchEncoding

pad_coords(batch, max_length=None, pad_to_multiple_of=None)[source]#

Pad the coordinates to the same length.

Return type:

Union[Mapping, List[Mapping]]

pad_fixed(batch, max_length=None, pad_to_multiple_of=None)[source]#

Pad the fixed mask to the same length.

Return type:

Union[Mapping, List[Mapping]]

pad_forces(batch, max_length=None, pad_to_multiple_of=None)[source]#

Pad the forces to the same length.

Return type:

Union[Mapping, List[Mapping]]

save_vocabulary(save_directory, filename_prefix=None)[source]#

Save the vocabulary to a json file.

Return type:

Tuple[str]