mmlearn.datasets.processors.tokenizers.HFTokenizer

class HFTokenizer(model_name_or_path, max_length=None, padding=False, truncation=None, **kwargs)[source]

Bases: object

A wrapper for loading HuggingFace tokenizers.

This class wraps any huggingface tokenizer that can be initialized with transformers.AutoTokenizer.from_pretrained(). It preprocesses the input text and returns a dictionary with the tokenized text and other relevant information like attention masks.

Parameters:
  • model_name_or_path (str) – Pretrained model name or path - same as in transformers.AutoTokenizer.from_pretrained().

  • max_length (Optional[int], optional, default=None) – Maximum length of the tokenized sequence. This is passed to the tokenizer __call__() method.

  • padding (bool or str, default=False) – Padding strategy. Same as in transformers.AutoTokenizer.from_pretrained(); passed to the tokenizer __call__() method.

  • truncation (Optional[Union[bool, str]], optional, default=None) – Truncation strategy. Same as in transformers.AutoTokenizer.from_pretrained(); passed to the tokenizer __call__() method.

  • **kwargs (Any) – Additional arguments passed to transformers.AutoTokenizer.from_pretrained().

Methods

__call__(sentence, **kwargs)[source]

Tokenize a text or a list of texts using the HuggingFace tokenizer.

Parameters:
  • sentence (Union[str, list[str]]) – Sentence(s) to be tokenized.

  • **kwargs (Any) – Additional arguments passed to the tokenizer __call__() method.

Returns:

Tokenized sentence(s).

Return type:

dict[str, torch.Tensor]

Notes

The input_ids key is replaced with Modalities.TEXT for consistency.