mmlearn.datasets.core.samplers¶
Samplers for data loading.
Classes
Sampler for weighted sampling from a |
|
Sampler for distributed evaluation. |
- class CombinedDatasetRatioSampler(dataset, ratios=None, num_samples=None, replacement=False, shuffle=True, rank=None, num_replicas=None, drop_last=False, seed=0)[source]¶
Sampler for weighted sampling from a
CombinedDataset
.- Parameters:
dataset (CombinedDataset) – An instance of
CombinedDataset
to sample from.ratios (Optional[Sequence[float]], optional, default=None) – A sequence of ratios for sampling from each dataset in the combined dataset. The length of the sequence must be equal to the number of datasets in the combined dataset (dataset). If None, the length of each dataset in the combined dataset is used as the ratio. The ratios are normalized to sum to 1.
num_samples (Optional[int], optional, default=None) – The number of samples to draw from the combined dataset. If None, the sampler will draw as many samples as there are in the combined dataset. This number must yield at least one sample per dataset in the combined dataset, when multiplied by the corresponding ratio.
replacement (bool, default=False) – Whether to sample with replacement or not.
shuffle (bool, default=True) – Whether to shuffle the sampled indices or not. If False, the indices of each dataset will appear in the order they are stored in the combined dataset. This is similar to sequential sampling from each dataset. The datasets that make up the combined dataset are still sampled randomly.
rank (Optional[int], optional, default=None) – Rank of the current process within
num_replicas
. By default,rank
is retrieved from the current distributed group.num_replicas (Optional[int], optional, default=None) – Number of processes participating in distributed training. By default,
num_replicas
is retrieved from the current distributed group.drop_last (bool, default=False) – Whether to drop the last incomplete batch or not. If True, the sampler will drop samples to make the number of samples evenly divisible by the number of replicas in distributed mode.
seed (int, default=0) – Random seed used to when sampling from the combined dataset and shuffling the sampled indices.
- dataset¶
The dataset to sample from.
- Type:
- probs¶
The probabilities for sampling from each dataset in the combined dataset. This is computed from the ratios argument and is normalized to sum to 1.
- Type:
- rank¶
Rank of the current process within
num_replicas
.- Type:
- drop_last¶
Whether to drop samples to make the number of samples evenly divisible by the number of replicas in distributed mode.
- Type:
- seed¶
Random seed used to when sampling from the combined dataset and shuffling the sampled indices.
- Type:
- epoch¶
Current epoch number. This is used to set the random seed. This is useful in distributed mode to ensure that each process receives a different random ordering of the samples.
- Type:
- class DistributedEvalSampler(dataset, num_replicas=None, rank=None, shuffle=False, seed=0)[source]¶
Sampler for distributed evaluation.
The main differences between this and
torch.utils.data.DistributedSampler
are that this sampler does not add extra samples to make it evenly divisible and shuffling is disabled by default.- Parameters:
dataset (torch.utils.data.Dataset) – Dataset used for sampling.
num_replicas (Optional[int], optional, default=None) – Number of processes participating in distributed training. By default,
rank
is retrieved from the current distributed group.rank (Optional[int], optional, default=None) – Rank of the current process within
num_replicas
. By default,rank
is retrieved from the current distributed group.shuffle (bool, optional, default=False) – If True (default), sampler will shuffle the indices.
seed (int, optional, default=0) – Random seed used to shuffle the sampler if
shuffle=True
. This number should be identical across all processes in the distributed group.
Warning
DistributedEvalSampler should NOT be used for training. The distributed processes could hang forever. See [1] for details
Notes
This sampler is for evaluation purpose where synchronization does not happen every epoch. Synchronization should be done outside the dataloader loop. It is especially useful in conjunction with
torch.nn.parallel.DistributedDataParallel
[2].The input Dataset is assumed to be of constant size.
This implementation is adapted from [3].
References
Examples
>>> def example(): ... start_epoch, n_epochs = 0, 2 ... sampler = DistributedEvalSampler(dataset) if is_distributed else None ... loader = DataLoader(dataset, shuffle=(sampler is None), sampler=sampler) ... for epoch in range(start_epoch, n_epochs): ... if is_distributed: ... sampler.set_epoch(epoch) ... evaluate(loader)