fl4health.utils.sampler module¶
- class DirichletLabelBasedSampler(unique_labels, hash_key=None, sample_percentage=0.5, beta=100)[source]¶
Bases:
LabelBasedSampler
- __init__(unique_labels, hash_key=None, sample_percentage=0.5, beta=100)[source]¶
class used to subsample a dataset so the classes of samples are distributed in a non-IID way. In particular, the DirichletLabelBasedSampler uses a dirichlet distribution to determine the number of samples from each class. The sampler is constructed by passing a beta parameter that determines the level of heterogeneity and a sample_percentage that determines the relative size of the modified dataset. Subsampling a dataset is accomplished by calling the subsample method and passing a BaseDataset object. This will return the resulting subsampled dataset.
NOTE: The range for beta is (0, infinity). The larger the value of beta, the more evenly the multinomial probability of the labels will be. The smaller beta is the more heterogeneous it is.
np.random.dirichlet([1]*5): array([0.23645891, 0.08857052, 0.29519184, 0.2999956 , 0.07978313]) np.random.dirichlet([1000]*5): array([0.2066252 , 0.19644968, 0.20080513, 0.19992536, 0.19619462])
- Parameters:
unique_labels (list[Any]) – The full set of labels contained in the dataset.
sample_percentage (float, optional) – The downsampling of the entire dataset to do. For example, if this value is 0.5 and the dataset is of size 100, we will end up with 50 total data points. Defaults to 0.5.
beta (float, optional) – This controls the heterogeneity of the label sampling. The smaller the beta, the more skewed the label assignments will be for the dataset. Defaults to 100.
hash_key (int | None, optional) – Seed for the random number generators and samplers. Defaults to None.
- class LabelBasedSampler(unique_labels)[source]¶
Bases:
ABC
- __init__(unique_labels)[source]¶
This is an abstract class to be extended to create dataset samplers based on the class of samples.
- Parameters:
unique_labels (list[Any]) – The full set of labels contained in the dataset.
- abstract subsample(dataset)[source]¶
- Return type:
TypeVar
(D
, bound=TensorDataset
|DictionaryDataset
)
- class MinorityLabelBasedSampler(unique_labels, downsampling_ratio, minority_labels)[source]¶
Bases:
LabelBasedSampler
- __init__(unique_labels, downsampling_ratio, minority_labels)[source]¶
This class is used to subsample a dataset so the classes are distributed in a non-IID way. In particular, the MinorityLabelBasedSampler explicitly downsamples classes based on the downsampling_ratio and minority_labels args used to construct the object. Subsampling a dataset is accomplished by calling the subsample method and passing a BaseDataset object. This will return the resulting subsampled dataset.
- Parameters:
unique_labels (list[T]) – The full set of labels contained in the dataset.
downsampling_ratio (float) – The percentage to which the specified “minority” labels are downsampled. For example, if a label L has 10 examples and the downsampling_ratio is 0.2, then 8 of the datapoints with label L are discarded.
minority_labels (Set[T]) – The labels subject to downsampling.