mmlearn.datasets.core.samplers.CombinedDatasetRatioSampler

class CombinedDatasetRatioSampler(dataset, ratios=None, num_samples=None, replacement=False, shuffle=True, rank=None, num_replicas=None, drop_last=False, seed=0)[source]

Bases: Sampler[int]

Sampler for weighted sampling from a CombinedDataset.

Parameters:
  • dataset (CombinedDataset) – An instance of CombinedDataset to sample from.

  • ratios (Optional[Sequence[float]], optional, default=None) – A sequence of ratios for sampling from each dataset in the combined dataset. The length of the sequence must be equal to the number of datasets in the combined dataset (dataset). If None, the length of each dataset in the combined dataset is used as the ratio. The ratios are normalized to sum to 1.

  • num_samples (Optional[int], optional, default=None) – The number of samples to draw from the combined dataset. If None, the sampler will draw as many samples as there are in the combined dataset. This number must yield at least one sample per dataset in the combined dataset, when multiplied by the corresponding ratio.

  • replacement (bool, default=False) – Whether to sample with replacement or not.

  • shuffle (bool, default=True) – Whether to shuffle the sampled indices or not. If False, the indices of each dataset will appear in the order they are stored in the combined dataset. This is similar to sequential sampling from each dataset. The datasets that make up the combined dataset are still sampled randomly.

  • rank (Optional[int], optional, default=None) – Rank of the current process within num_replicas. By default, rank is retrieved from the current distributed group.

  • num_replicas (Optional[int], optional, default=None) – Number of processes participating in distributed training. By default, num_replicas is retrieved from the current distributed group.

  • drop_last (bool, default=False) – Whether to drop the last incomplete batch or not. If True, the sampler will drop samples to make the number of samples evenly divisible by the number of replicas in distributed mode.

  • seed (int, default=0) – Random seed used to when sampling from the combined dataset and shuffling the sampled indices.

dataset

The dataset to sample from.

Type:

CombinedDataset

num_samples

The number of samples to draw from the combined dataset.

Type:

int

probs

The probabilities for sampling from each dataset in the combined dataset. This is computed from the ratios argument and is normalized to sum to 1.

Type:

torch.Tensor

replacement

Whether to sample with replacement or not.

Type:

bool

shuffle

Whether to shuffle the sampled indices or not.

Type:

bool

rank

Rank of the current process within num_replicas.

Type:

int

num_replicas

Number of processes participating in distributed training.

Type:

int

drop_last

Whether to drop samples to make the number of samples evenly divisible by the number of replicas in distributed mode.

Type:

bool

seed

Random seed used to when sampling from the combined dataset and shuffling the sampled indices.

Type:

int

epoch

Current epoch number. This is used to set the random seed. This is useful in distributed mode to ensure that each process receives a different random ordering of the samples.

Type:

int

total_size

The total number of samples across all processes.

Type:

int

Methods

Attributes

__iter__()[source]

Return an iterator that yields sample indices for the combined dataset.

Return type:

Iterator[int]

property num_samples: int

Return the number of samples managed by the sampler.

set_epoch(epoch)[source]

Set the epoch for this sampler.

When shuffle=True, this ensures all replicas use a different random ordering for each epoch. Otherwise, the next iteration of this sampler will yield the same ordering.

Parameters:

epoch (int) – Epoch number.

Return type:

None

property total_size: int

Return the total size of the dataset.