fl4health.model_bases.pca module

class PcaModule(low_rank=False, full_svd=False, rank_estimation=6)[source]

Bases: Module

__init__(low_rank=False, full_svd=False, rank_estimation=6)[source]

PyTorch module for performing Principal Component Analysis.

Notes:

  • If low_rank is set to True, then a value \(q\) for rank_estimation is required (either specified by the user or via its default value). If \(q\) is too far away from the actual rank \(k\) of the data matrix, then the resulting rank-q svd approximation is not guaranteed to be a good approximation of the data matrix.

  • If low_rank is set to True, then a value \(q\) for rank_estimation can be chosen according to the following criteria:

    • in general, \(k \leq q \leq \min(2\cdot k, m, n)\). For large low-rank matrices, take \(q = k + l\), where \(5 \leq l \leq 10\). If \(k\) is relatively small compared to \(\min(m, n)\), choosing \(l = 0, 1, 2\) may be sufficient.

  • If low_rank is set to True and rank_estimation is set to \(q\), then the module will utilize a randomized algorithm to compute a rank-q approximation of the data matrix via SVD.

For more details on this, see:

https://pytorch.org/docs/stable/generated/torch.svd_lowrank.html

and

https://pytorch.org/docs/stable/generated/torch.pca_lowrank.html

As per the official documentation of PyTorch, in general, the user should set low_rank to False. Setting it to True would be useful for huge sparse matrices.

Parameters:
  • low_rank (bool, optional) – Indicates whether the data matrix can be well-approximated by a low-rank singular value decomposition. If the user has good reasons to believe so, then this parameter can be set to True to allow for more efficient computations. Defaults to False.

  • full_svd (bool, optional) – Indicates whether full SVD or reduced SVD is performed. If low_rank is set to True, then an alternative implementation of SVD will be used and this argument is ignored. Defaults to False.

  • rank_estimation (int, optional) – A slight overestimation of the rank of the data matrix. Only used if self.low_rank is True. Defaults to 6.

center_data(X)[source]
Return type:

Tensor

compute_cumulative_explained_variance()[source]
Return type:

float

compute_explained_variance_ratios()[source]
Return type:

Tensor

compute_projection_variance(X, k, center_data=False)[source]

Compute the variance of the data matrix X after projection via PCA.

The variance is defined as | X @ U |_F ** 2

Parameters:
  • X (Tensor) – input data tensor whose rows represent data points.

  • k (int | None) – the number of principal components onto which projection is applied.

  • center_data (bool, optional) – Indicates whether to subtract data mean prior to projecting the data into a lower-dimensional subspace, and whether to add the data mean after projecting back. Defaults to False.

Returns:

variance after projection as defined above.

Return type:

float

compute_reconstruction_error(X, k, center_data=False)[source]

Compute the reconstruction error of X under PCA reconstruction.

More precisely, if X is an N by d data matrix whose rows are the data points, and U is the matrix whose columns are the principal components of X, then the reconstruction loss is defined as 1 / N * | X @ U @ U.T - X| ** 2.

NOTE: The reconstruction (after centering) is X @ U @ U.T because this method assumes that the rows of X are the data points while the columns of U are the principal components.

Parameters:
  • X (Tensor) – Input data tensor whose rows represent data points.

  • k (int | None) – The number of principal components onto which projection is applied.

  • center_data (bool, optional) – Indicates whether to subtract data mean prior to projecting the data into a lower-dimensional subspace, and whether to add the data mean after projecting back. Defaults to False.

Returns:

reconstruction loss as defined above.

Return type:

float

forward(X, center_data)[source]

Perform PCA on the data matrix X by computing its SVD.

NOTE: the algorithm assumes that the rows of X are the data points (after reshaping as needed). Consequently, the principal components, which are the eigenvectors of X.T @ X, are the right singular vectors in the SVD of X.

Parameters:
  • X (Tensor) – Data matrix.

  • center_data (bool) – If true, then the data mean will be subtracted from all data points prior to performing PCA. If center_data is false, it is expected that the data has already been centered and an exception will be thrown if it is not.

Returns:

The principal components (i.e., right singular vectors) and their corresponding singular values.

Return type:

tuple[Tensor, Tensor]

maybe_reshape(X)[source]

Reshape input tensor X as needed so SVD can be computed. Reshaping is required when each data point is an N-dimensional tensor because PCA requires X to be a 2D data matrix.

Parameters:

X (Tensor) – Data matrix

Returns:

tensor flattened to be 2D

Return type:

Tensor

prepare_data_forward(X, center_data)[source]

Prepare input data X for PCA by reshaping and centering it as needed.

Parameters:
  • X (Tensor) – Data matrix.

  • center_data (bool) – If true, then the data mean will be subtracted from all data points prior to performing PCA. If center_data is false, it is expected that the data has already been centered and an exception will be thrown if it is not.

Returns:

Prepared data matrix.

Return type:

Tensor

project_back(X_lower_dim, add_mean=False)[source]

Project low-dimensional principal representations back into the original space to recover the reconstruction of data points.

Parameters:
  • X_lower_dim (Tensor) – Matrix whose rows are low-dimensional principal representations of the original data.

  • add_mean (bool, optional) – Indicates whether the training data mean should be added to the projection result. This can be set to True if the user centered the data prior to dimensionality reduction and now wish to add back the data mean. Defaults to False.

Returns:

Reconstruction of data points.

Return type:

Tensor

project_lower_dim(X, k=None, center_data=False)[source]

Project input data X onto the top k principal components.

NOTE*: The result of projection (after centering) is X @ U because this method assumes that the rows of X are the data points while the columns of U are the principal components.

Parameters:
  • X (Tensor) – Input data matrix whose rows are the data points.

  • k (int | None, optional) – The number of principal components onto which projection is done. If k is None, then all principal components will be used in the projection. Defaults to None.

  • center_data (bool, optional) – If true, then the training data mean (learned in the forward pass) will be subtracted from all data points prior to projection. If center_data is false, it is expected that the data has already been centered in this manner by the user. Defaults to False.

Returns:

Projection result.

Return type:

Tensor

set_data_mean(X)[source]

The primary purpose of this method is to store the mean of the training data so it can be used to center validation/test data later, if needed.

Parameters:

X (Tensor) – Data matrix

Return type:

None

set_principal_components(principal_components, singular_values)[source]
Return type:

None