mmlearn.datasets.core.samplers

Samplers for data loading.

Classes

CombinedDatasetRatioSampler

Sampler for weighted sampling from a CombinedDataset.

DistributedEvalSampler

Sampler for distributed evaluation.

class CombinedDatasetRatioSampler(dataset, ratios=None, num_samples=None, replacement=False, shuffle=True, rank=None, num_replicas=None, drop_last=False, seed=0)[source]

Sampler for weighted sampling from a CombinedDataset.

Parameters:
  • dataset (CombinedDataset) – An instance of CombinedDataset to sample from.

  • ratios (Optional[Sequence[float]], optional, default=None) – A sequence of ratios for sampling from each dataset in the combined dataset. The length of the sequence must be equal to the number of datasets in the combined dataset (dataset). If None, the length of each dataset in the combined dataset is used as the ratio. The ratios are normalized to sum to 1.

  • num_samples (Optional[int], optional, default=None) – The number of samples to draw from the combined dataset. If None, the sampler will draw as many samples as there are in the combined dataset. This number must yield at least one sample per dataset in the combined dataset, when multiplied by the corresponding ratio.

  • replacement (bool, default=False) – Whether to sample with replacement or not.

  • shuffle (bool, default=True) – Whether to shuffle the sampled indices or not. If False, the indices of each dataset will appear in the order they are stored in the combined dataset. This is similar to sequential sampling from each dataset. The datasets that make up the combined dataset are still sampled randomly.

  • rank (Optional[int], optional, default=None) – Rank of the current process within num_replicas. By default, rank is retrieved from the current distributed group.

  • num_replicas (Optional[int], optional, default=None) – Number of processes participating in distributed training. By default, num_replicas is retrieved from the current distributed group.

  • drop_last (bool, default=False) – Whether to drop the last incomplete batch or not. If True, the sampler will drop samples to make the number of samples evenly divisible by the number of replicas in distributed mode.

  • seed (int, default=0) – Random seed used to when sampling from the combined dataset and shuffling the sampled indices.

dataset

The dataset to sample from.

Type:

CombinedDataset

num_samples

The number of samples to draw from the combined dataset.

Type:

int

probs

The probabilities for sampling from each dataset in the combined dataset. This is computed from the ratios argument and is normalized to sum to 1.

Type:

torch.Tensor

replacement

Whether to sample with replacement or not.

Type:

bool

shuffle

Whether to shuffle the sampled indices or not.

Type:

bool

rank

Rank of the current process within num_replicas.

Type:

int

num_replicas

Number of processes participating in distributed training.

Type:

int

drop_last

Whether to drop samples to make the number of samples evenly divisible by the number of replicas in distributed mode.

Type:

bool

seed

Random seed used to when sampling from the combined dataset and shuffling the sampled indices.

Type:

int

epoch

Current epoch number. This is used to set the random seed. This is useful in distributed mode to ensure that each process receives a different random ordering of the samples.

Type:

int

total_size

The total number of samples across all processes.

Type:

int

__iter__()[source]

Return an iterator that yields sample indices for the combined dataset.

Return type:

Iterator[int]

__len__()[source]

Return the total number of samples in the sampler.

Return type:

int

property num_samples: int

Return the number of samples managed by the sampler.

set_epoch(epoch)[source]

Set the epoch for this sampler.

When shuffle=True, this ensures all replicas use a different random ordering for each epoch. Otherwise, the next iteration of this sampler will yield the same ordering.

Parameters:

epoch (int) – Epoch number.

Return type:

None

property total_size: int

Return the total size of the dataset.

class DistributedEvalSampler(dataset, num_replicas=None, rank=None, shuffle=False, seed=0)[source]

Sampler for distributed evaluation.

The main differences between this and torch.utils.data.DistributedSampler are that this sampler does not add extra samples to make it evenly divisible and shuffling is disabled by default.

Parameters:
  • dataset (torch.utils.data.Dataset) – Dataset used for sampling.

  • num_replicas (Optional[int], optional, default=None) – Number of processes participating in distributed training. By default, rank is retrieved from the current distributed group.

  • rank (Optional[int], optional, default=None) – Rank of the current process within num_replicas. By default, rank is retrieved from the current distributed group.

  • shuffle (bool, optional, default=False) – If True (default), sampler will shuffle the indices.

  • seed (int, optional, default=0) – Random seed used to shuffle the sampler if shuffle=True. This number should be identical across all processes in the distributed group.

Warning

DistributedEvalSampler should NOT be used for training. The distributed processes could hang forever. See [1] for details

Notes

  • This sampler is for evaluation purpose where synchronization does not happen every epoch. Synchronization should be done outside the dataloader loop. It is especially useful in conjunction with torch.nn.parallel.DistributedDataParallel [2].

  • The input Dataset is assumed to be of constant size.

  • This implementation is adapted from [3].

References

Examples

>>> def example():
...     start_epoch, n_epochs = 0, 2
...     sampler = DistributedEvalSampler(dataset) if is_distributed else None
...     loader = DataLoader(dataset, shuffle=(sampler is None), sampler=sampler)
...     for epoch in range(start_epoch, n_epochs):
...         if is_distributed:
...             sampler.set_epoch(epoch)
...         evaluate(loader)
__iter__()[source]

Return an iterator that iterates over the indices of the dataset.

Return type:

Iterator[int]

__len__()[source]

Return the number of samples.

Return type:

int

property num_samples: int

Return the number of samples managed by the sampler.

set_epoch(epoch)[source]

Set the epoch for this sampler.

When shuffle=True, this ensures all replicas use a different random ordering for each epoch. Otherwise, the next iteration of this sampler will yield the same ordering.

Parameters:

epoch (int) – Epoch number.

Return type:

None

property total_size: int

Return the total size of the dataset.