Getting Started¶
mmlearn contains a collection of tools and utilities to help researchers and practitioners easily set up and run training or evaluation experiments for multimodal representation learning methods. The toolkit is designed to be modular and extensible. We aim to provide a high degree of flexibility in using existing methods, while also allowing users to easily add support for new modalities of data, datasets, models and pretraining or evaluation methods.
Much of the power and flexibility of mmlearn comes from building on top of the PyTorch Lightning framework and using Hydra and hydra-zen for configuration management. Together, these tools make it easy to define and run experiments with different configurations, and to scale up experiments to run on a SLURM cluster.
The goal of this guide is to give you a brief overview of what mmlearn is and how you can get started using it.
Note
mmlearn currently only supports training and evaluation of encoder-only models.
For more detailed information on the features and capabilities of mmlearn, please refer to the API Reference.
Defining a Dataset¶
Datasets in mmlearn can be defined using PyTorch’s Dataset
or IterableDataset
classes. However, there are two additional requirements for datasets in mmlearn:
The dataset must return an instance of
Example
from the__getitem__()
method or the__iter__()
method.The
Example
object returned by the dataset must contain the key'example_index'
and use modality-specific keys from theModalities
registry to store the data.
Example 1: Defining a map-style dataset in mmlearn:
from torch.utils.data.dataset import Dataset
from mmlearn.datasets.core import Example, Modalities
from mmlearn.constants import EXAMPLE_INDEX_KEY
class MyMapStyleDataset(Dataset[Example]):
...
def __getitem__(self, idx: int) -> Example:
...
return Example(
{
EXAMPLE_INDEX_KEY: idx,
Modalities.TEXT.name: ...,
Modalities.RGB.name: ...,
Modalities.RGB.target: ...,
Modalities.TEXT.mask: ...,
...
}
)
Example 2: Defining an iterable-style dataset in mmlearn:
from torch.utils.data.dataset import IterableDataset
from mmlearn.datasets.core import Example, Modalities
from mmlearn.constants import EXAMPLE_INDEX_KEY
class MyIterableStyleDataset(IterableDataset[Example]):
...
def __iter__(self) -> Generator[Example, None, None]:
...
idx = 0
for item in items:
yield Example(
{
EXAMPLE_INDEX_KEY: idx,
Modalities.TEXT.name: ...,
Modalities.AUDIO.name: ...,
Modalities.TEXT.mask: ...,
Modalities.AUDIO.mask: ...,
...
}
)
idx += 1
The Example
class represents a single example in the dataset and all the attributes
associated with it. The class is an extension of the OrderedDict
class that provides attribute-style access
to the dictionary values and handles the creation of the 'example_ids'
tuple, combining the 'example_index'
and 'dataset_index'
values.
Modalities
is an instance of ModalityRegistry
singleton class that serves as a global registry for all the modalities supported by mmlearn. It allows dot-style access
registered modalities and their properties. For example, the 'RGB'
modality can be accessed using Modalities.RGB
(returns string 'rgb'
) and the 'target'
property of the 'RGB'
modality can be accessed using Modalities.RGB.target
(returns the string 'rgb_target'
). It also provides a method to register new modalities and their properties. For example,
the following code snippet shows how to register a new 'DNA'
modality:
from mmlearn.datasets.core import Modalities
Modalities.register_modality("dna")
Creating a Model¶
Models in mmlearn are generally defined by extending PyTorch’s nn.Module
class. The input to the model’s
forward method should be a dictionary, where the keys are the names of the modalities and the values are the corresponding
(batched) tensors/data. The models must also return a list-like object where the first element is the last layer’s output.
import torch
from torch import nn
from mmlearn.datasets.core import Modalities
class MyTextEncoder(nn.Module):
def __init__(self, input_dim: int, output_dim: int):
super().__init__()
self.encoder = ...
def forward(self, inputs: dict[str, torch.Tensor]) -> tuple[torch.Tensor]:
out = self.encoder(
inputs[Modalities.TEXT.name],
inputs.get(
"attention_mask", inputs.get(Modalities.TEXT.attention_mask, None)
),
)
return (out,)
Passing a dictionary of the (batched) inputs to the model’s forward method makes it easier to reuse the same model for different tasks.
Creating and Configuring a Project¶
A project in mmlearn can be thought of as a collection of related experiments. Within a project, you can reuse components from mmlearn (e.g., datasets, models, tasks) or define new ones and use them all together for experiments.
To create a new project, create a new directory following the structure below:
my_project/
├── configs/
│ ├── __init__.py
│ └── experiment/
│ ├── my_experiment.yaml
├── README.md (optional)
├── requirements.txt (optional)
The configs/
directory contains all the configurations, both structured configs
and YAML config files for the experiments in the project. The configs/experiment/
directory contains the .yaml files
for the experiments associated with the project. These .yaml files use the Hydra configuration format,
which also allows overriding the configuration options/values from the command line.
The __init__.py
file in the configs/
directory is required to make the configs/
directory a Python package,
allowing hydra to compose configurations from .yaml files as well as structured configs from python modules. More on this
in the next section.
Optionally, you can also include a README.md
file with a brief description of the project and a requirements.txt
file
with the dependencies required to run the project.
Specifying Configurable Components¶
One of the key features of the Hydra configuration system is the ability to compose configurations from multiple sources,
including the command line, .yaml files and structured configs from Python modules. Structured Configs
in Hydra use Python dataclass()
to define the configuration schema. This allows for both static and runtime type-checking
of the configuration. Hydra-zen extends Hydra to makes it easy
to dynamically generate dataclass-backed configurations for any class or function simply by adding a decorator to the class
or function.
mmlearn provides a pre-populated config store,
external_store
, which can be used as a decorator to register configurable components. This config
store already contains configurations for common components like PyTorch optimizers
,
learning rate schedulers
, loss functions and samplers,
as well as PyTorch Lightning’s Trainer callbacks
and loggers
.
To dynamically add new configurable components to the store, simply add the external_store
decorator
to the class or function definition.
For example, the following code snippet shows how to register a new dataset class:
from torch.utils.data.dataset import Dataset
from mmlearn.conf import external_store
from mmlearn.constants import EXAMPLE_INDEX_KEY
from mmlearn.datasets.core import Example, Modalities
@external_store(group="datasets")
class MyMapStyleDataset(Dataset[Example]):
...
def __getitem__(self, idx: int) -> Example:
...
return Example(
{
EXAMPLE_INDEX_KEY: idx,
Modalities.TEXT.name: ...,
Modalities.RGB.name: ...,
Modalities.RGB.target: ...,
Modalities.TEXT.mask: ...,
...
}
)
The external_store
decorator immediately add the class to the config store once the Python interpreter
loads the module containing the class. This is why the configs/
directory must be a Python package and why modules
containing user-defined configurable components must be imported in the configs/__init__.py
file.
The group
argument specifies the config group
under which the configurable component will be registered. This allows users to easily reference the component in the
configurations using the group name and the class name. The available config groups in mmlearn are:
datasets
: Contains all the dataset classes.datasets/masking
: Contains all the configurable classes and functions for masking input data.datasets/tokenizers
: Contains all the configurable classes and functions for converting raw inputs to tokens.datasets/transforms
: Contains all the configurable classes and functions for transforming input data.dataloader/sampler
: Contains all the dataloader sampler classes.modules/encoders
: Contains all the encoder modules.modules/layers
: For layers that can be used independent of the model.modules/losses
: Contains all the loss functions.modules/optimizers
: Contains all the optimizers.modules/lr_schedulers
: Contains all the learning rate schedulers.modules/metrics
: Contains all the evaluation metrics.tasks
: Contains all the task classes.trainer/callbacks
: Contains all the PyTorch Lightning Trainer callbacks.trainer/logger
: Contains all the PyTorch Lightning Trainer loggers.
The Base Configuration¶
The base configuration for all experiments in mmlearn are defined in the MMLearnConf
dataclass. This serves as the base configuration for all experiments and can be extended to include additional configuration
options, following Hydra’s override syntax.
The base configuration for mmlearn is shown below:
experiment_name: ???
job_type: train
seed: null
datasets:
train: null
val: null
test: null
dataloader:
train:
_target_: torch.utils.data.dataloader.DataLoader
_convert_: object
dataset: ???
batch_size: 1
shuffle: null
sampler: null
batch_sampler: null
num_workers: 0
collate_fn:
_target_: mmlearn.datasets.core.data_collator.DefaultDataCollator
batch_processors: null
pin_memory: true
drop_last: false
timeout: 0.0
worker_init_fn: null
multiprocessing_context: null
generator: null
prefetch_factor: null
persistent_workers: false
pin_memory_device: ''
val:
_target_: torch.utils.data.dataloader.DataLoader
_convert_: object
dataset: ???
batch_size: 1
shuffle: null
sampler: null
batch_sampler: null
num_workers: 0
collate_fn:
_target_: mmlearn.datasets.core.data_collator.DefaultDataCollator
batch_processors: null
pin_memory: true
drop_last: false
timeout: 0.0
worker_init_fn: null
multiprocessing_context: null
generator: null
prefetch_factor: null
persistent_workers: false
pin_memory_device: ''
test:
_target_: torch.utils.data.dataloader.DataLoader
_convert_: object
dataset: ???
batch_size: 1
shuffle: null
sampler: null
batch_sampler: null
num_workers: 0
collate_fn:
_target_: mmlearn.datasets.core.data_collator.DefaultDataCollator
batch_processors: null
pin_memory: true
drop_last: false
timeout: 0.0
worker_init_fn: null
multiprocessing_context: null
generator: null
prefetch_factor: null
persistent_workers: false
pin_memory_device: ''
task: ???
trainer:
_target_: lightning.pytorch.trainer.trainer.Trainer
accelerator: auto
strategy: auto
devices: auto
num_nodes: 1
precision: null
logger: null
callbacks: null
fast_dev_run: false
max_epochs: null
min_epochs: null
max_steps: -1
min_steps: null
max_time: null
limit_train_batches: null
limit_val_batches: null
limit_test_batches: null
limit_predict_batches: null
overfit_batches: 0.0
val_check_interval: null
check_val_every_n_epoch: 1
num_sanity_val_steps: null
log_every_n_steps: null
enable_checkpointing: true
enable_progress_bar: true
enable_model_summary: true
accumulate_grad_batches: 1
gradient_clip_val: null
gradient_clip_algorithm: null
deterministic: null
benchmark: null
inference_mode: true
use_distributed_sampler: true
profiler: null
detect_anomaly: false
barebones: false
plugins: null
sync_batchnorm: false
reload_dataloaders_every_n_epochs: 0
default_root_dir: ${hydra:runtime.output_dir}/checkpoints
tags:
- ${experiment_name}
resume_from_checkpoint: null
strict_loading: true
torch_compile_kwargs:
disable: true
fullgraph: false
dynamic: null
backend: inductor
mode: null
options: null
The config keys with a value of ???
are placeholders that must be overridden in the experiment configurations. While
the dataset
key in the dataloader
group is also a placeholder, it should not be provided as it will be automatically
filled in from the datasets
group.
Running an Experiment¶
To run an experiment locally, use the following command:
mmlearn_run 'hydra.searchpath=[pkg://path.to.my_project.configs]' \
+experiment=my_experiment \
experiment_name=my_experiment_name
Tip
You can see the full config for an experiment without running it by adding the --help
flag to the command.
mmlearn_run 'hydra.searchpath=[pkg://path.to.my_project.configs]' \
+experiment=my_experiment \
experiment_name=my_experiment_name \
task=my_task \ # required for the command to run
--help
To run the experiment on a SLURM cluster, use the following command:
mmlearn_run --multirun \
hydra.launcher.mem_per_cpu=5G \
hydra.launcher.qos=your_qos \
hydra.launcher.partition=your_partition \
hydra.launcher.gres=gpu:4 \
hydra.launcher.cpus_per_task=8 \
hydra.launcher.tasks_per_node=4 \
hydra.launcher.nodes=1 \
hydra.launcher.stderr_to_stdout=true \
hydra.launcher.timeout_min=720 \
'hydra.searchpath=[pkg://path.to.my_project.configs]' \
+experiment=my_experiment \
experiment_name=my_experiment_name
This uses the submitit launcher plugin built into Hydra to submit the experiment to the SLURM scheduler with the specified resources.
Note
After the job is submitted, it is okay to cancel the program with Ctrl+C
. The job will continue running on
the cluster. You can also add &
at the end of the command to run it in the background.