fl4health.feature_alignment.tab_features_preprocessor module¶
- class TabularFeaturesPreprocessor(tab_feature_encoder)[source]¶
Bases:
object- __init__(tab_feature_encoder)[source]¶
TabularFeaturesPreprocessoris responsible for constructing the appropriate column transformers based on the information encoded intab_feature_encoder. These transformers will then be applied to a pandas dataframe.Each tabular feature, which corresponds to a column in the pandas dataframe, has its own column transformer. A default transformer is initialized for each feature based on its data type, but the user may also manually specify a transformer for this feature.
- Parameters:
tab_feature_encoder (TabularFeaturesInfoEncoder) – Encodes the information necessary for constructing the column transformers.
- fill_in_missing_columns(df)[source]¶
Return a new DataFrame where entire missing columns are filled with values specified in each column’s default fill value.
- Parameters:
df (pd.DataFrame) – Dataframe to be filled
- Returns:
Filled dataframe
- Return type:
pd.DataFrame
- get_default_binary_pipeline()[source]¶
Default binary pipeline factor. Most frequent imputer and an ordinal encoder.
- Returns:
Default binary pipeline
- Return type:
Pipeline
- get_default_numeric_pipeline()[source]¶
Default numeric pipeline factory. Mean imputation and default min-max scaler.
- Returns:
Default numeric pipeline
- Return type:
Pipeline
- get_default_one_hot_pipeline(categories)[source]¶
Default one hot encoding pipeline. Unknowns are ignored, categories are provided as an input.
- Parameters:
categories (MetaData) – Categories to be one hot encoded.
- Returns:
Default one-hot encoding pipeline
- Return type:
Pipeline
- get_default_ordinal_pipeline(categories)[source]¶
Default ordinal pipeline. Unknowns have a category. Other categories are provided.
- Parameters:
categories (MetaData) – Categories to be used in encoding
- Returns:
Default ordinal pipeline
- Return type:
Pipeline
- get_default_string_pipeline(vocabulary)[source]¶
Default string/text encoding pipeline. The vocabulary is provided and this is used to instantiate a default
TfidfVectorizer.- Parameters:
vocabulary (MetaData) – Vocabulary to serve as the
TfidfVectorizervocab.- Returns:
Default string/text encoding pipeline.
- Return type:
Pipeline
- initialize_default_pipelines(tabular_features, one_hot)[source]¶
Initialize a default Pipeline for every data column in
tabular_features.- Parameters:
tabular_features (list[TabularFeature]) – list of tabular features in the data columns.
one_hot (bool) – Whether or not to apply a default one-hot pipeline.
- Returns:
Default feature processing pipeline per feature in the list.
- Return type:
- preprocess_features(df)[source]¶
Preprocess the provided dataframe with the specified pipelines.
- Parameters:
df (pd.DataFrame) – Dataframe to be processed.
- Returns:
Resulting input and target numpy arrays after preprocessing.
- Return type:
tuple[NDArray, NDArray]
- return_column_transformer(pipelines)[source]¶
Given a set of pipelines create a set of column transformations based on those pipelines.