mmlearn.datasets.processors.tokenizers¶
Tokenizers - modules that convert raw input to sequences of tokens.
Classes
A wrapper for loading HuggingFace tokenizers. |
|
Convert a batch of images to a batch of sequences. |
- class HFTokenizer(model_name_or_path, max_length=None, padding=False, truncation=None, **kwargs)[source]¶
A wrapper for loading HuggingFace tokenizers.
This class wraps any huggingface tokenizer that can be initialized with
transformers.AutoTokenizer.from_pretrained()
. It preprocesses the input text and returns a dictionary with the tokenized text and other relevant information like attention masks.- Parameters:
model_name_or_path (str) – Pretrained model name or path - same as in
transformers.AutoTokenizer.from_pretrained()
.max_length (Optional[int], optional, default=None) – Maximum length of the tokenized sequence. This is passed to the tokenizer
__call__()
method.padding (bool or str, default=False) – Padding strategy. Same as in
transformers.AutoTokenizer.from_pretrained()
; passed to the tokenizer__call__()
method.truncation (Optional[Union[bool, str]], optional, default=None) – Truncation strategy. Same as in
transformers.AutoTokenizer.from_pretrained()
; passed to the tokenizer__call__()
method.**kwargs (Any) – Additional arguments passed to
transformers.AutoTokenizer.from_pretrained()
.
- __call__(sentence, **kwargs)[source]¶
Tokenize a text or a list of texts using the HuggingFace tokenizer.
- Parameters:
sentence (Union[str, list[str]]) – Sentence(s) to be tokenized.
**kwargs (Any) – Additional arguments passed to the tokenizer
__call__()
method.
- Returns:
Tokenized sentence(s).
- Return type:
Notes
The
input_ids
key is replaced withModalities.TEXT
for consistency.
- class Img2Seq(img_size, patch_size, n_channels, d_model)[source]¶
Convert a batch of images to a batch of sequences.
- Parameters:
- __call__(batch)[source]¶
Convert a batch of images to a batch of sequences.
- Parameters:
batch (torch.Tensor) – Batch of images of shape
(b, h, w, c)
whereb
is the batch size,h
is the height,w
is the width, andc
is the number of channels.- Returns:
Batch of sequences of shape
(b, s, d)
whereb
is the batch size,s
is the sequence length, andd
is the dimension of the output sequence.- Return type: