fl4health.model_bases.pca module¶
- class PcaModule(low_rank=False, full_svd=False, rank_estimation=6)[source]¶
Bases:
Module
- __init__(low_rank=False, full_svd=False, rank_estimation=6)[source]¶
PyTorch module for performing Principal Component Analysis.
Notes:
If
low_rank
is set to True, then a value \(q\) forrank_estimation
is required (either specified by the user or via its default value). If \(q\) is too far away from the actual rank \(k\) of the data matrix, then the resulting rank-q svd approximation is not guaranteed to be a good approximation of the data matrix.If
low_rank
is set to True, then a value \(q\) forrank_estimation
can be chosen according to the following criteria:in general, \(k \leq q \leq \min(2\cdot k, m, n)\). For large low-rank matrices, take \(q = k + l\), where \(5 \leq l \leq 10\). If \(k\) is relatively small compared to \(\min(m, n)\), choosing \(l = 0, 1, 2\) may be sufficient.
If
low_rank
is set to True andrank_estimation
is set to \(q\), then the module will utilize a randomized algorithm to compute a rank-q approximation of the data matrix via SVD.
For more details on this, see:
https://pytorch.org/docs/stable/generated/torch.svd_lowrank.html
and
https://pytorch.org/docs/stable/generated/torch.pca_lowrank.html
As per the official documentation of PyTorch, in general, the user should set
low_rank
to False. Setting it to True would be useful for huge sparse matrices.- Parameters:
low_rank (bool, optional) – Indicates whether the data matrix can be well-approximated by a low-rank singular value decomposition. If the user has good reasons to believe so, then this parameter can be set to True to allow for more efficient computations. Defaults to False.
full_svd (bool, optional) – Indicates whether full SVD or reduced SVD is performed. If
low_rank
is set to True, then an alternative implementation of SVD will be used and this argument is ignored. Defaults to False.rank_estimation (int, optional) – A slight overestimation of the rank of the data matrix. Only used if
self.low_rank
is True. Defaults to 6.
- compute_projection_variance(X, k, center_data=False)[source]¶
Compute the variance of the data matrix X after projection via PCA.
The variance is defined as
| X @ U |_F ** 2
- Parameters:
X (Tensor) – input data tensor whose rows represent data points.
k (int | None) – the number of principal components onto which projection is applied.
center_data (bool, optional) – Indicates whether to subtract data mean prior to projecting the data into a lower-dimensional subspace, and whether to add the data mean after projecting back. Defaults to False.
- Returns:
variance after projection as defined above.
- Return type:
- compute_reconstruction_error(X, k, center_data=False)[source]¶
Compute the reconstruction error of X under PCA reconstruction.
More precisely, if X is an N by d data matrix whose rows are the data points, and U is the matrix whose columns are the principal components of X, then the reconstruction loss is defined as 1 / N * | X @ U @ U.T - X| ** 2.
NOTE: The reconstruction (after centering) is X @ U @ U.T because this method assumes that the rows of X are the data points while the columns of U are the principal components.
- Parameters:
X (Tensor) – Input data tensor whose rows represent data points.
k (int | None) – The number of principal components onto which projection is applied.
center_data (bool, optional) – Indicates whether to subtract data mean prior to projecting the data into a lower-dimensional subspace, and whether to add the data mean after projecting back. Defaults to False.
- Returns:
reconstruction loss as defined above.
- Return type:
- forward(X, center_data)[source]¶
Perform PCA on the data matrix X by computing its SVD.
NOTE: the algorithm assumes that the rows of X are the data points (after reshaping as needed). Consequently, the principal components, which are the eigenvectors of X.T @ X, are the right singular vectors in the SVD of X.
- Parameters:
X (Tensor) – Data matrix.
center_data (bool) – If true, then the data mean will be subtracted from all data points prior to performing PCA. If
center_data
is false, it is expected that the data has already been centered and an exception will be thrown if it is not.
- Returns:
The principal components (i.e., right singular vectors) and their corresponding singular values.
- Return type:
tuple[Tensor, Tensor]
- maybe_reshape(X)[source]¶
Reshape input tensor X as needed so SVD can be computed. Reshaping is required when each data point is an N-dimensional tensor because PCA requires X to be a 2D data matrix.
- Parameters:
X (Tensor) – Data matrix
- Returns:
tensor flattened to be 2D
- Return type:
Tensor
- prepare_data_forward(X, center_data)[source]¶
Prepare input data X for PCA by reshaping and centering it as needed.
- Parameters:
X (Tensor) – Data matrix.
center_data (bool) – If true, then the data mean will be subtracted from all data points prior to performing PCA. If center_data is false, it is expected that the data has already been centered and an exception will be thrown if it is not.
- Returns:
Prepared data matrix.
- Return type:
Tensor
- project_back(X_lower_dim, add_mean=False)[source]¶
Project low-dimensional principal representations back into the original space to recover the reconstruction of data points.
- Parameters:
X_lower_dim (Tensor) – Matrix whose rows are low-dimensional principal representations of the original data.
add_mean (bool, optional) – Indicates whether the training data mean should be added to the projection result. This can be set to True if the user centered the data prior to dimensionality reduction and now wish to add back the data mean. Defaults to False.
- Returns:
Reconstruction of data points.
- Return type:
Tensor
- project_lower_dim(X, k=None, center_data=False)[source]¶
Project input data X onto the top k principal components.
NOTE*: The result of projection (after centering) is X @ U because this method assumes that the rows of X are the data points while the columns of U are the principal components.
- Parameters:
X (Tensor) – Input data matrix whose rows are the data points.
k (int | None, optional) – The number of principal components onto which projection is done. If k is None, then all principal components will be used in the projection. Defaults to None.
center_data (bool, optional) – If true, then the training data mean (learned in the forward pass) will be subtracted from all data points prior to projection. If center_data is false, it is expected that the data has already been centered in this manner by the user. Defaults to False.
- Returns:
Projection result.
- Return type:
Tensor