mmlearn.datasets.processors.tokenizers.HFTokenizer¶
- class HFTokenizer(model_name_or_path, max_length=None, padding=False, truncation=None, **kwargs)[source]¶
Bases:
object
A wrapper for loading HuggingFace tokenizers.
This class wraps any huggingface tokenizer that can be initialized with
transformers.AutoTokenizer.from_pretrained()
. It preprocesses the input text and returns a dictionary with the tokenized text and other relevant information like attention masks.- Parameters:
model_name_or_path (str) – Pretrained model name or path - same as in
transformers.AutoTokenizer.from_pretrained()
.max_length (Optional[int], optional, default=None) – Maximum length of the tokenized sequence. This is passed to the tokenizer
__call__()
method.padding (bool or str, default=False) – Padding strategy. Same as in
transformers.AutoTokenizer.from_pretrained()
; passed to the tokenizer__call__()
method.truncation (Optional[Union[bool, str]], optional, default=None) – Truncation strategy. Same as in
transformers.AutoTokenizer.from_pretrained()
; passed to the tokenizer__call__()
method.**kwargs (Any) – Additional arguments passed to
transformers.AutoTokenizer.from_pretrained()
.
Methods
- __call__(sentence, **kwargs)[source]¶
Tokenize a text or a list of texts using the HuggingFace tokenizer.
- Parameters:
sentence (Union[str, list[str]]) – Sentence(s) to be tokenized.
**kwargs (Any) – Additional arguments passed to the tokenizer
__call__()
method.
- Returns:
Tokenized sentence(s).
- Return type:
Notes
The
input_ids
key is replaced withModalities.TEXT
for consistency.