mmlearn.datasets.processors.tokenizers

Tokenizers - modules that convert raw input to sequences of tokens.

Classes

HFTokenizer

A wrapper for loading HuggingFace tokenizers.

Img2Seq

Convert a batch of images to a batch of sequences.

class HFTokenizer(model_name_or_path, max_length=None, padding=False, truncation=None, **kwargs)[source]

A wrapper for loading HuggingFace tokenizers.

This class wraps any huggingface tokenizer that can be initialized with transformers.AutoTokenizer.from_pretrained(). It preprocesses the input text and returns a dictionary with the tokenized text and other relevant information like attention masks.

Parameters:
  • model_name_or_path (str) – Pretrained model name or path - same as in transformers.AutoTokenizer.from_pretrained().

  • max_length (Optional[int], optional, default=None) – Maximum length of the tokenized sequence. This is passed to the tokenizer __call__() method.

  • padding (bool or str, default=False) – Padding strategy. Same as in transformers.AutoTokenizer.from_pretrained(); passed to the tokenizer __call__() method.

  • truncation (Optional[Union[bool, str]], optional, default=None) – Truncation strategy. Same as in transformers.AutoTokenizer.from_pretrained(); passed to the tokenizer __call__() method.

  • **kwargs (Any) – Additional arguments passed to transformers.AutoTokenizer.from_pretrained().

__call__(sentence, **kwargs)[source]

Tokenize a text or a list of texts using the HuggingFace tokenizer.

Parameters:
  • sentence (Union[str, list[str]]) – Sentence(s) to be tokenized.

  • **kwargs (Any) – Additional arguments passed to the tokenizer __call__() method.

Returns:

Tokenized sentence(s).

Return type:

dict[str, torch.Tensor]

Notes

The input_ids key is replaced with Modalities.TEXT for consistency.

class Img2Seq(img_size, patch_size, n_channels, d_model)[source]

Convert a batch of images to a batch of sequences.

Parameters:
  • img_size (tuple of int) – The size of the input image.

  • patch_size (tuple of int) – The size of the patch.

  • n_channels (int) – The number of channels in the input image.

  • d_model (int) – The dimension of the output sequence.

__call__(batch)[source]

Convert a batch of images to a batch of sequences.

Parameters:

batch (torch.Tensor) – Batch of images of shape (b, h, w, c) where b is the batch size, h is the height, w is the width, and c is the number of channels.

Returns:

Batch of sequences of shape (b, s, d) where b is the batch size, s is the sequence length, and d is the dimension of the output sequence.

Return type:

torch.Tensor