fl4health.feature_alignment.tab_features_preprocessor module

class TabularFeaturesPreprocessor(tab_feature_encoder)[source]

Bases: object

__init__(tab_feature_encoder)[source]

TabularFeaturesPreprocessor is responsible for constructing the appropriate column transformers based on the information encoded in tab_feature_encoder. These transformers will then be applied to a pandas dataframe.

Each tabular feature, which corresponds to a column in the pandas dataframe, has its own column transformer. A default transformer is initialized for each feature based on its data type, but the user may also manually specify a transformer for this feature.

Parameters:

tab_feature_encoder (TabularFeaturesInfoEncoder) – Encodes the information necessary for constructing the column transformers.

fill_in_missing_columns(df)[source]

Return a new DataFrame where entire missing columns are filled with values specified in each column’s default fill value.

Parameters:

df (pd.DataFrame) – Dataframe to be filled

Returns:

Filled dataframe

Return type:

pd.DataFrame

get_default_binary_pipeline()[source]

Default binary pipeline factor. Most frequent imputer and an ordinal encoder.

Returns:

Default binary pipeline

Return type:

Pipeline

get_default_numeric_pipeline()[source]

Default numeric pipeline factory. Mean imputation and default min-max scaler.

Returns:

Default numeric pipeline

Return type:

Pipeline

get_default_one_hot_pipeline(categories)[source]

Default one hot encoding pipeline. Unknowns are ignored, categories are provided as an input.

Parameters:

categories (MetaData) – Categories to be one hot encoded.

Returns:

Default one-hot encoding pipeline

Return type:

Pipeline

get_default_ordinal_pipeline(categories)[source]

Default ordinal pipeline. Unknowns have a category. Other categories are provided.

Parameters:

categories (MetaData) – Categories to be used in encoding

Returns:

Default ordinal pipeline

Return type:

Pipeline

get_default_string_pipeline(vocabulary)[source]

Default string/text encoding pipeline. The vocabulary is provided and this is used to instantiate a default TfidfVectorizer.

Parameters:

vocabulary (MetaData) – Vocabulary to serve as the TfidfVectorizer vocab.

Returns:

Default string/text encoding pipeline.

Return type:

Pipeline

initialize_default_pipelines(tabular_features, one_hot)[source]

Initialize a default Pipeline for every data column in tabular_features.

Parameters:
  • tabular_features (list[TabularFeature]) – list of tabular features in the data columns.

  • one_hot (bool) – Whether or not to apply a default one-hot pipeline.

Returns:

Default feature processing pipeline per feature in the list.

Return type:

dict[str, Pipeline]

preprocess_features(df)[source]

Preprocess the provided dataframe with the specified pipelines.

Parameters:

df (pd.DataFrame) – Dataframe to be processed.

Returns:

Resulting input and target numpy arrays after preprocessing.

Return type:

tuple[NDArray, NDArray]

return_column_transformer(pipelines)[source]

Given a set of pipelines create a set of column transformations based on those pipelines.

Parameters:

pipelines (dict[str, Pipeline]) – Dictionary of pipelines for columns with the keys of the dictionary corresponding to the column names

Returns:

Transformer for the specified columns. The unspecified columns are dropped.

Return type:

ColumnTransformer

set_feature_pipeline(feature_name, pipeline)[source]

This method allows the user to customize a specific pipeline to be applied to a specific feature. For example, the user may want to use different scalers for two distinct numerical features.

Parameters:
  • feature_name (str) – target column name in the dataframe to apply the pipeline to

  • pipeline (Pipeline) – Pipeline to apply to the associated column.

Return type:

None