mmlearn.datasets.processors.tokenizers module¶

Tokenizers - modules that convert raw input to sequences of tokens.

class HFTokenizer(model_name_or_path, max_length=None, padding=False, truncation=None, **kwargs)[source]¶

Bases: object

A wrapper for loading HuggingFace tokenizers.

This class wraps any huggingface tokenizer that can be initialized with transformers.AutoTokenizer.from_pretrained(). It preprocesses the input text and returns a dictionary with the tokenized text and other relevant information like attention masks.

Parameters:

model_name_or_path (str) – Pretrained model name or path - same as in transformers.AutoTokenizer.from_pretrained().
max_length (Optional[int], optional, default=None) – Maximum length of the tokenized sequence. This is passed to the tokenizer __call__() method.
padding (bool or str, default=False) – Padding strategy. Same as in transformers.AutoTokenizer.from_pretrained(); passed to the tokenizer __call__() method.
truncation (Optional[Union[bool, str]], optional, default=None) – Truncation strategy. Same as in transformers.AutoTokenizer.from_pretrained(); passed to the tokenizer __call__() method.
**kwargs (Any) – Additional arguments passed to transformers.AutoTokenizer.from_pretrained().

__call__(sentence, **kwargs)[source]¶

Tokenize a text or a list of texts using the HuggingFace tokenizer.

Parameters:

sentence (Union[str, list[str]]) – Sentence(s) to be tokenized.
**kwargs (Any) – Additional arguments passed to the tokenizer __call__() method.

Returns:

Tokenized sentence(s).

Return type:

dict[str, torch.Tensor]

Notes

The input_ids key is replaced with Modalities.TEXT for consistency.

class Img2Seq(img_size, patch_size, n_channels, d_model)[source]¶

Bases: Module

Convert a batch of images to a batch of sequences.

Parameters:

img_size (tuple of int) – The size of the input image.
patch_size (tuple of int) – The size of the patch.
n_channels (int) – The number of channels in the input image.
d_model (int) – The dimension of the output sequence.

__call__(batch)[source]¶

Convert a batch of images to a batch of sequences.

Parameters:: batch (torch.Tensor) – Batch of images of shape (b, h, w, c) where b is the batch size, h is the height, w is the width, and c is the number of channels.
Returns:: Batch of sequences of shape (b, s, d) where b is the batch size, s is the sequence length, and d is the dimension of the output sequence.
Return type:: torch.Tensor