Prolonged Length of Stay Prediction#

This notebook showcases length of stay prediction on the Synthea dataset using CyclOps. The task is formulated as a binary classification task, where we predict the probability that a patient will stay 7 days or longer.

To generate the synthetic patient data:

  1. Generate synthea data using their releases. We used v3.0.0.

  2. Follow instructions provided in ETL-Synthea to load the CSV data into a postgres database.

Import Libraries#

[1]:
"""Prolonged Length of Stay Prediction."""

import copy
import shutil
from datetime import date

import cycquery.ops as qo
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
from cycquery import DatasetQuerier
from datasets import Dataset
from datasets.features import ClassLabel
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import MinMaxScaler, OneHotEncoder

from cyclops.data.df.feature import TabularFeatures
from cyclops.data.slicer import SliceSpec
from cyclops.evaluate.fairness import FairnessConfig  # noqa: E402
from cyclops.evaluate.metrics import create_metric
from cyclops.evaluate.metrics.experimental.metric_dict import MetricDict
from cyclops.models.catalog import create_model
from cyclops.report import ModelCardReport
from cyclops.report.plot.classification import ClassificationPlotter
from cyclops.report.utils import flatten_results_dict
from cyclops.tasks import BinaryTabularClassificationTask
/mnt/data/actions_runners/cyclops-actions-runner-1/_work/cyclops/cyclops/.venv/lib/python3.10/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
  from .autonotebook import tqdm as notebook_tqdm

CyclOps offers a package for documentation of the model through a model report. The ModelCardReport class is used to populate and generate the model report as an HTML file. The model report has the following sections:

  • Overview: Provides a high level overview of how the model is doing (a quick glance of important metrics), and how it is doing over time (performance over several metrics and subgroups over time).

  • Datasets: High level statistics of the training data, including changes in distribution over time.

  • Quantitative Analysis: This section contains additional detailed performance metrics of the model for different sets of the data and subpopulations.

  • Fairness Analysis: This section contains the fairness metrics of the model.

  • Model Details: This section contains descriptive metadata about the model such as the owners, version, license, etc.

  • Model Parameters: This section contains the technical details of the model such as the model architecture, training parameters, etc.

  • Considerations: This section contains descriptions of the considerations involved in developing and using the model such as the intended use, limitations, etc.

We will use this to document the model development process as we go along and generate the model report at the end.

The model report tool is a work in progress and is subject to change.

[2]:
report = ModelCardReport()

Constants#

[3]:
NAN_THRESHOLD = 0.25
NUM_DAYS = 7
TRAIN_SIZE = 0.8
RANDOM_SEED = 85

Data Querying#

Compute length of stay (labels)#

  1. Get encounters, compute length of stay.

  2. Filter out encounters less than 2 days and greater than 20 days.

[4]:
querier = DatasetQuerier(
    dbms="postgresql",
    port=5432,
    host="localhost",
    database="synthea_demo",
    user="postgres",
    password="pwd",
)


def get_encounters():
    """Get encounters data."""
    patients = querier.native.patients()
    ops = qo.Sequential(
        qo.Rename({"id": "patient_id"}),
        qo.Keep(["patient_id", "birthdate", "gender", "race", "ethnicity"]),
    )
    patients = patients.ops(ops)
    encounters = querier.native.encounters()
    patient_encounters = encounters.join(
        patients,
        on=("patient", "patient_id"),
        isouter=True,
    )
    ops = qo.Sequential(
        qo.Rename({"id": "encounter_id"}),
        qo.ExtractTimestampComponent("start", "year", "start_year"),
        qo.ExtractTimestampComponent("birthdate", "year", "birthdate_year"),
        qo.AddColumn(
            "start_year",
            "birthdate_year",
            new_col_labels="age",
            negative=True,
        ),
        qo.AddColumn("stop", "start", new_col_labels="los", negative=True),
        qo.ConditionGreaterThan("los", 1),
        qo.ConditionLessThan("los", 21),
        qo.Keep(
            [
                "encounter_id",
                "los",
                "age",
                "gender",
            ],
        ),
    )
    return patient_encounters.ops(ops)


def get_observations(cohort):
    """Get observations data."""
    observations = querier.native.observations()
    ops = qo.Sequential(
        qo.ConditionIn(
            "category",
            [
                "laboratory",
                "vital-signs",
            ],
        ),
        qo.ConditionEquals("type", "numeric"),
    )
    observations = observations.ops(ops)
    cohort = cohort.join(
        observations,
        on=("encounter_id", "encounter"),
        isouter=True,
    )
    groupby_op = qo.GroupByAggregate(
        "encounter_id",
        {"description": ("count", "n_obs")},
    )
    observations = cohort.run()
    observations_count = cohort.ops(groupby_op).run()
    observations_stats = observations.pivot_table(
        index="encounter_id",
        columns="description",
        values="value",
        aggfunc="max",
    ).add_prefix("obs_")

    return [observations_count, observations_stats]


def get_medications(cohort):
    """Get medications data."""
    medications = querier.native.medications()
    cohort = cohort.join(
        medications,
        on=("encounter_id", "encounter"),
    )
    groupby_op = qo.GroupByAggregate(
        "encounter_id",
        {"description": ("count", "n_meds")},
    )

    return cohort.ops(groupby_op).run()


def get_procedures(cohort):
    """Get procedures data."""
    procedures = querier.native.procedures()
    cohort = cohort.join(
        procedures,
        on=("encounter_id", "encounter"),
    )
    groupby_op = qo.GroupByAggregate(
        "encounter_id",
        {"description": ("count", "n_procedures")},
    )

    return cohort.ops(groupby_op).run()


def run_query():
    """Run query pipeline."""
    cohort_query = get_encounters()
    to_merge = []
    observations = get_observations(cohort_query)
    to_merge.extend(observations)
    medications = get_medications(cohort_query)
    to_merge.append(medications)
    procedures = get_procedures(cohort_query)
    to_merge.append(procedures)
    cohort = cohort_query.run()
    for to_merge_df in to_merge:
        cohort = cohort.merge(
            to_merge_df,
            on="encounter_id",
            how="left",
        )

    return cohort


cohort = run_query()
2024-07-16 17:42:08,280 INFO cycquery.orm    - Database setup, ready to run queries!
2024-07-16 17:43:27,641 INFO cycquery.orm    - Query returned successfully!
2024-07-16 17:43:27,644 INFO cycquery.utils.profile - Finished executing function run_query in 78.232933 s
2024-07-16 17:43:32,803 INFO cycquery.orm    - Query returned successfully!
2024-07-16 17:43:32,805 INFO cycquery.utils.profile - Finished executing function run_query in 5.158978 s
2024-07-16 17:43:43,033 INFO cycquery.orm    - Query returned successfully!
2024-07-16 17:43:43,035 INFO cycquery.utils.profile - Finished executing function run_query in 6.230752 s
2024-07-16 17:43:50,774 INFO cycquery.orm    - Query returned successfully!
2024-07-16 17:43:50,777 INFO cycquery.utils.profile - Finished executing function run_query in 7.735571 s
2024-07-16 17:43:50,967 INFO cycquery.orm    - Query returned successfully!
2024-07-16 17:43:50,969 INFO cycquery.utils.profile - Finished executing function run_query in 0.191167 s

Data Inspection and Preprocessing#

Drop NaNs based on the NAN_THRESHOLD#

[5]:
null_counts = cohort.isnull().sum()[cohort.isnull().sum() > 0]
fig = go.Figure(data=[go.Bar(x=null_counts.index, y=null_counts.values)])
fig.update_layout(
    title="Number of Null Values per Column",
    xaxis_title="Columns",
    yaxis_title="Number of Null Values",
)
fig.show()

Add the figure to the report

We can use the log_plotly_figure method to add the figure to a section of the report. One can specify whether the figure should be interactive or not by setting the interactive parameter to True or False respectively. The default value is True. This also affects the final size of the report. If the figure is interactive, the size of the report will be larger than if the figure is not interactive.

[6]:
report.log_plotly_figure(
    fig=fig,
    caption="Number of Null Values per Column",
    section_name="datasets",
    interactive=True,
)
[7]:
thresh_nan = int(NAN_THRESHOLD * len(cohort))
cohort = cohort.dropna(axis=1, thresh=thresh_nan)

Length of stay distribution#

[8]:
length_of_stay = cohort["los"]
length_of_stay_counts = list(length_of_stay.value_counts().values)
length_of_stay_keys = list(length_of_stay.value_counts().keys())
cohort["outcome"] = cohort["los"] < NUM_DAYS
fig = go.Figure(data=[go.Bar(x=length_of_stay_keys, y=length_of_stay_counts)])
fig.update_layout(
    title="Length of stay",
    xaxis_title="Days",
    yaxis_title="Number of encounters",
)
fig.show()

Add the figure to the report

[9]:
report.log_plotly_figure(
    fig=fig,
    caption="Length of stay distribution",
    section_name="datasets",
)

Outcome distribution#

[10]:
cohort["outcome"] = cohort["outcome"].astype("int")
fig = px.pie(cohort, names="outcome")
fig.update_traces(textinfo="percent+label")
fig.update_layout(title_text="Outcome Distribution")
fig.update_traces(
    hovertemplate="Outcome: %{label}<br>Count: \
    %{value}<br>Percent: %{percent}",
)
fig.show()

Add the figure to the report

[11]:
report.log_plotly_figure(
    fig=fig,
    caption="Outcome Distribution",
    section_name="datasets",
)
[12]:
class_counts = cohort["outcome"].value_counts()
class_ratio = class_counts[0] / class_counts[1]
print(class_ratio)
0.5573997233748271

Gender distribution#

[13]:
fig = px.pie(cohort, names="gender")
fig.update_layout(
    title="Gender Distribution",
)
fig.show()

Add the figure to the report

[14]:
report.log_plotly_figure(
    fig=fig,
    caption="Gender Distribution",
    section_name="datasets",
)

Age distribution#

[15]:
fig = px.histogram(cohort, x="age")
fig.update_layout(
    title="Age Distribution",
    xaxis_title="Age",
    yaxis_title="Count",
    bargap=0.2,
)
fig.show()

Add the figure to the report

[16]:
report.log_plotly_figure(
    fig=fig,
    caption="Age Distribution",
    section_name="datasets",
)

Identifying feature types#

Cyclops TabularFeatures class helps to identify feature types, an essential step before preprocessing the data. Understanding feature types (numerical/categorical/binary) allows us to apply appropriate preprocessing steps for each type.

[17]:
features_list = [
    "age",
    "gender",
    "n_obs",
    "n_meds",
    "n_procedures",
    "obs_Alanine aminotransferase [Enzymatic activity/volume] in Serum or Plasma",
    "obs_Albumin [Mass/volume] in Serum or Plasma",
    "obs_Alkaline phosphatase [Enzymatic activity/volume] in Serum or Plasma",
    "obs_Aspartate aminotransferase [Enzymatic activity/volume] in Serum or Plasma",
    "obs_Bilirubin.total [Mass/volume] in Serum or Plasma",
    "obs_Body Weight",
    "obs_Calcium [Mass/volume] in Serum or Plasma",
    "obs_Carbon dioxide  total [Moles/volume] in Serum or Plasma",
    "obs_Chloride [Moles/volume] in Serum or Plasma",
    "obs_Creatinine [Mass/volume] in Serum or Plasma",
    "obs_Diastolic Blood Pressure",
    "obs_Erythrocyte distribution width [Ratio] by Automated count",
    "obs_Erythrocytes [#/volume] in Blood by Automated count",
    "obs_Ferritin [Mass/volume] in Serum or Plasma",
    "obs_Glomerular filtration rate/1.73 sq M.predicted",
    "obs_Glucose [Mass/volume] in Serum or Plasma",
    "obs_Hematocrit [Volume Fraction] of Blood by Automated count",
    "obs_Hemoglobin [Mass/volume] in Blood",
    "obs_Leukocytes [#/volume] in Blood by Automated count",
    "obs_MCH [Entitic mass] by Automated count",
    "obs_MCHC [Mass/volume] by Automated count",
    "obs_MCV [Entitic volume] by Automated count",
    "obs_Oxygen saturation in Arterial blood",
    "obs_Platelets [#/volume] in Blood by Automated count",
    "obs_Potassium [Moles/volume] in Serum or Plasma",
    "obs_Protein [Mass/volume] in Serum or Plasma",
    "obs_Sodium [Moles/volume] in Serum or Plasma",
    "obs_Systolic Blood Pressure",
    "obs_Troponin I.cardiac [Mass/volume] in Serum or Plasma by High sensitivity method",  # noqa: E501
    "obs_Urea nitrogen [Mass/volume] in Serum or Plasma",
]
features_list = sorted(features_list)
tab_features = TabularFeatures(
    data=cohort.reset_index(),
    features=features_list,
    by="encounter_id",
    targets="outcome",
)
print(tab_features.types)
{'outcome': 'binary', 'obs_Creatinine [Mass/volume] in Serum or Plasma': 'numeric', 'obs_Potassium [Moles/volume] in Serum or Plasma': 'numeric', 'obs_Body Weight': 'numeric', 'obs_Diastolic Blood Pressure': 'numeric', 'obs_MCV [Entitic volume] by Automated count': 'numeric', 'obs_Chloride [Moles/volume] in Serum or Plasma': 'numeric', 'obs_Hemoglobin [Mass/volume] in Blood': 'numeric', 'obs_Bilirubin.total [Mass/volume] in Serum or Plasma': 'numeric', 'obs_Oxygen saturation in Arterial blood': 'numeric', 'obs_Hematocrit [Volume Fraction] of Blood by Automated count': 'numeric', 'obs_Leukocytes [#/volume] in Blood by Automated count': 'numeric', 'obs_MCHC [Mass/volume] by Automated count': 'numeric', 'obs_Calcium [Mass/volume] in Serum or Plasma': 'numeric', 'obs_Ferritin [Mass/volume] in Serum or Plasma': 'numeric', 'obs_Albumin [Mass/volume] in Serum or Plasma': 'numeric', 'obs_Glomerular filtration rate/1.73 sq M.predicted': 'numeric', 'obs_Erythrocytes [#/volume] in Blood by Automated count': 'numeric', 'obs_Alanine aminotransferase [Enzymatic activity/volume] in Serum or Plasma': 'numeric', 'obs_Aspartate aminotransferase [Enzymatic activity/volume] in Serum or Plasma': 'numeric', 'obs_Alkaline phosphatase [Enzymatic activity/volume] in Serum or Plasma': 'numeric', 'obs_Troponin I.cardiac [Mass/volume] in Serum or Plasma by High sensitivity method': 'numeric', 'obs_Carbon dioxide  total [Moles/volume] in Serum or Plasma': 'numeric', 'n_obs': 'numeric', 'obs_Urea nitrogen [Mass/volume] in Serum or Plasma': 'numeric', 'obs_Platelets [#/volume] in Blood by Automated count': 'numeric', 'gender': 'binary', 'n_procedures': 'numeric', 'age': 'numeric', 'obs_Protein [Mass/volume] in Serum or Plasma': 'numeric', 'obs_Erythrocyte distribution width [Ratio] by Automated count': 'numeric', 'obs_Glucose [Mass/volume] in Serum or Plasma': 'numeric', 'obs_Sodium [Moles/volume] in Serum or Plasma': 'numeric', 'obs_MCH [Entitic mass] by Automated count': 'numeric', 'n_meds': 'ordinal', 'obs_Systolic Blood Pressure': 'numeric'}

Creating data preprocessors#

We create a data preprocessor using sklearn’s ColumnTransformer. This helps in applying different preprocessing steps to different columns in the dataframe. For instance, binary features might be processed differently from numeric features.

[18]:
numeric_transformer = Pipeline(
    steps=[("imputer", SimpleImputer(strategy="mean")), ("scaler", MinMaxScaler())],
)

binary_transformer = Pipeline(
    steps=[("imputer", SimpleImputer(strategy="most_frequent"))],
)
[19]:
numeric_features = sorted((tab_features.features_by_type("numeric")))
numeric_indices = [
    cohort[features_list].columns.get_loc(column) for column in numeric_features
]
print(numeric_features)
['age', 'n_obs', 'n_procedures', 'obs_Alanine aminotransferase [Enzymatic activity/volume] in Serum or Plasma', 'obs_Albumin [Mass/volume] in Serum or Plasma', 'obs_Alkaline phosphatase [Enzymatic activity/volume] in Serum or Plasma', 'obs_Aspartate aminotransferase [Enzymatic activity/volume] in Serum or Plasma', 'obs_Bilirubin.total [Mass/volume] in Serum or Plasma', 'obs_Body Weight', 'obs_Calcium [Mass/volume] in Serum or Plasma', 'obs_Carbon dioxide  total [Moles/volume] in Serum or Plasma', 'obs_Chloride [Moles/volume] in Serum or Plasma', 'obs_Creatinine [Mass/volume] in Serum or Plasma', 'obs_Diastolic Blood Pressure', 'obs_Erythrocyte distribution width [Ratio] by Automated count', 'obs_Erythrocytes [#/volume] in Blood by Automated count', 'obs_Ferritin [Mass/volume] in Serum or Plasma', 'obs_Glomerular filtration rate/1.73 sq M.predicted', 'obs_Glucose [Mass/volume] in Serum or Plasma', 'obs_Hematocrit [Volume Fraction] of Blood by Automated count', 'obs_Hemoglobin [Mass/volume] in Blood', 'obs_Leukocytes [#/volume] in Blood by Automated count', 'obs_MCH [Entitic mass] by Automated count', 'obs_MCHC [Mass/volume] by Automated count', 'obs_MCV [Entitic volume] by Automated count', 'obs_Oxygen saturation in Arterial blood', 'obs_Platelets [#/volume] in Blood by Automated count', 'obs_Potassium [Moles/volume] in Serum or Plasma', 'obs_Protein [Mass/volume] in Serum or Plasma', 'obs_Sodium [Moles/volume] in Serum or Plasma', 'obs_Systolic Blood Pressure', 'obs_Troponin I.cardiac [Mass/volume] in Serum or Plasma by High sensitivity method', 'obs_Urea nitrogen [Mass/volume] in Serum or Plasma']
[20]:
binary_features = sorted(tab_features.features_by_type("binary"))
binary_features.remove("outcome")
binary_indices = [
    cohort[features_list].columns.get_loc(column) for column in binary_features
]
print(binary_features)
['gender']
[21]:
preprocessor = ColumnTransformer(
    transformers=[
        ("num", numeric_transformer, numeric_indices),
        ("onehot", OneHotEncoder(handle_unknown="ignore"), binary_indices),
    ],
    remainder="passthrough",
)

Creating Hugging Face Dataset#

We convert our processed Pandas dataframe into a Hugging Face dataset, a powerful and easy-to-use data format which is also compatible with CyclOps models and evaluator modules. The dataset is then split to train and test sets.

[22]:
cohort = cohort.drop(columns=["encounter_id", "los"])
dataset = Dataset.from_pandas(cohort)
dataset.cleanup_cache_files()
[22]:
0
[23]:
dataset = dataset.cast_column("outcome", ClassLabel(num_classes=2))
dataset = dataset.train_test_split(
    train_size=TRAIN_SIZE,
    stratify_by_column="outcome",
    seed=RANDOM_SEED,
)
Casting the dataset: 0%| | 0/1126 [00:00&lt;?, ? examples/s]

</pre>

Casting the dataset: 0%| | 0/1126 [00:00<?, ? examples/s]

end{sphinxVerbatim}

Casting the dataset: 0%| | 0/1126 [00:00<?, ? examples/s]

Casting the dataset: 100%|██████████| 1126/1126 [00:00&lt;00:00, 39943.73 examples/s]

</pre>

Casting the dataset: 100%|██████████| 1126/1126 [00:00<00:00, 39943.73 examples/s]

end{sphinxVerbatim}

Casting the dataset: 100%|██████████| 1126/1126 [00:00<00:00, 39943.73 examples/s]


Model Creation#

CyclOps model registry allows for straightforward creation and selection of models. This registry maintains a list of pre-configured models, which can be instantiated with a single line of code. Here we use a XGBoost classifier to fit a logisitic regression model. The model configurations can be passed to create_model based on the parameters for XGBClassifier.

[24]:
model_name = "xgb_classifier"
model = create_model(model_name, random_state=123)

Task Creation#

We use Cyclops tasks to define our model’s task (in this case, BinaryTabularClassificationTask), train the model, make predictions, and evaluate performance. Cyclops task classes encapsulate the entire ML pipeline into a single, cohesive structure, making the process smooth and easy to manage.

[25]:
los_task = BinaryTabularClassificationTask(
    {model_name: model},
    task_features=features_list,
    task_target="outcome",
)
los_task.list_models()
[25]:
['xgb_classifier']

Training#

If best_model_params is passed to the train method, the best model will be selected after the hyperparameter search. The parameters in best_model_params indicate the values to create the parameters grid.

Note that the data preprocessor needs to be passed to the tasks methods if the Hugging Face dataset is not already preprocessed.

[26]:
best_model_params = {
    "n_estimators": [100, 250, 500],
    "learning_rate": [0.1, 0.01],
    "max_depth": [2, 5],
    "reg_lambda": [0, 1, 10],
    "colsample_bytree": [0.7, 0.8, 1],
    "gamma": [0, 1, 2, 10],
    "method": "random",
}
los_task.train(
    dataset["train"],
    model_name=model_name,
    transforms=preprocessor,
    best_model_params=best_model_params,
)
2024-07-16 17:50:23,423 INFO cyclops.models.wrappers.sk_model - Best reg_lambda: 10
2024-07-16 17:50:23,425 INFO cyclops.models.wrappers.sk_model - Best n_estimators: 250
2024-07-16 17:50:23,426 INFO cyclops.models.wrappers.sk_model - Best max_depth: 5
2024-07-16 17:50:23,428 INFO cyclops.models.wrappers.sk_model - Best learning_rate: 0.1
2024-07-16 17:50:23,429 INFO cyclops.models.wrappers.sk_model - Best gamma: 2
2024-07-16 17:50:23,430 INFO cyclops.models.wrappers.sk_model - Best colsample_bytree: 0.8
[26]:
XGBClassifier(base_score=None, booster=None, callbacks=None,
              colsample_bylevel=None, colsample_bynode=None,
              colsample_bytree=0.8, early_stopping_rounds=None,
              enable_categorical=False, eval_metric='logloss',
              feature_types=None, gamma=2, gpu_id=None, grow_policy=None,
              importance_type=None, interaction_constraints=None,
              learning_rate=0.1, max_bin=None, max_cat_threshold=None,
              max_cat_to_onehot=None, max_delta_step=None, max_depth=5,
              max_leaves=None, min_child_weight=3, missing=nan,
              monotone_constraints=None, n_estimators=250, n_jobs=None,
              num_parallel_tree=None, predictor=None, random_state=123, ...)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
[27]:
model_params = los_task.list_models_params()[model_name]
print(model_params)
{'objective': 'binary:logistic', 'use_label_encoder': None, 'base_score': None, 'booster': None, 'callbacks': None, 'colsample_bylevel': None, 'colsample_bynode': None, 'colsample_bytree': 0.8, 'early_stopping_rounds': None, 'enable_categorical': False, 'eval_metric': 'logloss', 'feature_types': None, 'gamma': 2, 'gpu_id': None, 'grow_policy': None, 'importance_type': None, 'interaction_constraints': None, 'learning_rate': 0.1, 'max_bin': None, 'max_cat_threshold': None, 'max_cat_to_onehot': None, 'max_delta_step': None, 'max_depth': 5, 'max_leaves': None, 'min_child_weight': 3, 'missing': nan, 'monotone_constraints': None, 'n_estimators': 250, 'n_jobs': None, 'num_parallel_tree': None, 'predictor': None, 'random_state': 123, 'reg_alpha': None, 'reg_lambda': 10, 'sampling_method': None, 'scale_pos_weight': None, 'subsample': None, 'tree_method': None, 'validate_parameters': None, 'verbosity': None, 'seed': 123}

Log the model parameters to the report.

We can add model parameters to the model card using the log_model_parameters method.

[28]:
report.log_model_parameters(params=model_params)

Prediction#

The prediction output can be either the whole Hugging Face dataset with the prediction columns added to it or the single column containing the predicted values.

[29]:
y_pred = los_task.predict(
    dataset["test"],
    model_name=model_name,
    transforms=preprocessor,
    proba=False,
    only_predictions=True,
)
print(len(y_pred))
Map: 0%| | 0/226 [00:00&lt;?, ? examples/s]

</pre>

Map: 0%| | 0/226 [00:00<?, ? examples/s]

end{sphinxVerbatim}

Map: 0%| | 0/226 [00:00<?, ? examples/s]

Map: 100%|██████████| 226/226 [00:00&lt;00:00, 1009.64 examples/s]

</pre>

Map: 100%|██████████| 226/226 [00:00<00:00, 1009.64 examples/s]

end{sphinxVerbatim}

Map: 100%|██████████| 226/226 [00:00<00:00, 1009.64 examples/s]

Map: 100%|██████████| 226/226 [00:00&lt;00:00, 964.13 examples/s]

</pre>

Map: 100%|██████████| 226/226 [00:00<00:00, 964.13 examples/s]

end{sphinxVerbatim}

Map: 100%|██████████| 226/226 [00:00<00:00, 964.13 examples/s]

226

Evaluation#

Evaluation is done using various evaluation metrics that provide different perspectives on the model’s predictive abilities i.e. standard performance metrics and fairness metrics.

The standard performance metrics can be created using the MetricDict object.

[30]:
metric_names = [
    "binary_accuracy",
    "binary_precision",
    "binary_recall",
    "binary_f1_score",
    "binary_auroc",
    "binary_roc_curve",
    "binary_precision_recall_curve",
    "binary_confusion_matrix",
]
metrics = [
    create_metric(metric_name, experimental=True) for metric_name in metric_names
]
metric_collection = MetricDict(metrics)

In addition to overall metrics, it might be interesting to see how the model performs on certain subpopulations. We can define these subpopulations using SliceSpec objects.

[31]:
spec_list = [
    {
        "age": {
            "min_value": 20,
            "max_value": 50,
            "min_inclusive": True,
            "max_inclusive": False,
        },
    },
    {
        "age": {
            "min_value": 50,
            "max_value": 80,
            "min_inclusive": True,
            "max_inclusive": False,
        },
    },
    {"gender": {"value": "M"}},
    {"gender": {"value": "F"}},
]
slice_spec = SliceSpec(spec_list)

A MetricDict can also be defined for the fairness metrics.

[32]:
specificity = create_metric(metric_name="binary_specificity", experimental=True)
sensitivity = create_metric(metric_name="binary_sensitivity", experimental=True)
fpr = (
    -specificity + 1
)  # rsub is not supported due to limitations in the array API standard
fnr = -sensitivity + 1
ber = (fpr + fnr) / 2
fairness_metric_collection = MetricDict(
    {
        "Sensitivity": sensitivity,
        "Specificity": specificity,
        "BER": ber,
    },
)

The FairnessConfig helps in setting up and evaluating the fairness of the model predictions.

[33]:
fairness_config = FairnessConfig(
    metrics=fairness_metric_collection,
    dataset=None,  # dataset is passed from the evaluator
    target_columns=None,  # target columns are passed from the evaluator
    groups=["gender", "age"],
    group_bins={"age": [20, 40]},
    group_base_values={"age": 40, "gender": "M"},
    thresholds=[0.5],
)

The evaluate methods outputs the evaluation results and the Hugging Face dataset with the predictions added to it.

[34]:
results, dataset_with_preds = los_task.evaluate(
    dataset["test"],
    metric_collection,
    model_names=model_name,
    transforms=preprocessor,
    prediction_column_prefix="preds",
    slice_spec=slice_spec,
    batch_size=-1,
    fairness_config=fairness_config,
    override_fairness_metrics=False,
)
Map: 0%| | 0/226 [00:00&lt;?, ? examples/s]

</pre>

Map: 0%| | 0/226 [00:00<?, ? examples/s]

end{sphinxVerbatim}

Map: 0%| | 0/226 [00:00<?, ? examples/s]

Map: 100%|██████████| 226/226 [00:00&lt;00:00, 1011.01 examples/s]

</pre>

Map: 100%|██████████| 226/226 [00:00<00:00, 1011.01 examples/s]

end{sphinxVerbatim}

Map: 100%|██████████| 226/226 [00:00<00:00, 1011.01 examples/s]

Map: 100%|██████████| 226/226 [00:00&lt;00:00, 971.47 examples/s]

</pre>

Map: 100%|██████████| 226/226 [00:00<00:00, 971.47 examples/s]

end{sphinxVerbatim}

Map: 100%|██████████| 226/226 [00:00<00:00, 971.47 examples/s]


Flattening the indices: 0%| | 0/226 [00:00&lt;?, ? examples/s]

</pre>

Flattening the indices: 0%| | 0/226 [00:00<?, ? examples/s]

end{sphinxVerbatim}

Flattening the indices: 0%| | 0/226 [00:00<?, ? examples/s]

Flattening the indices: 100%|██████████| 226/226 [00:00&lt;00:00, 901.27 examples/s]

</pre>

Flattening the indices: 100%|██████████| 226/226 [00:00<00:00, 901.27 examples/s]

end{sphinxVerbatim}

Flattening the indices: 100%|██████████| 226/226 [00:00<00:00, 901.27 examples/s]

Flattening the indices: 100%|██████████| 226/226 [00:00&lt;00:00, 840.56 examples/s]

</pre>

Flattening the indices: 100%|██████████| 226/226 [00:00<00:00, 840.56 examples/s]

end{sphinxVerbatim}

Flattening the indices: 100%|██████████| 226/226 [00:00<00:00, 840.56 examples/s]


Flattening the indices: 0%| | 0/226 [00:00&lt;?, ? examples/s]

</pre>

Flattening the indices: 0%| | 0/226 [00:00<?, ? examples/s]

end{sphinxVerbatim}

Flattening the indices: 0%| | 0/226 [00:00<?, ? examples/s]

Flattening the indices: 100%|██████████| 226/226 [00:00&lt;00:00, 30888.71 examples/s]

</pre>

Flattening the indices: 100%|██████████| 226/226 [00:00<00:00, 30888.71 examples/s]

end{sphinxVerbatim}

Flattening the indices: 100%|██████████| 226/226 [00:00<00:00, 30888.71 examples/s]


Map: 0%| | 0/226 [00:00&lt;?, ? examples/s]

</pre>

Map: 0%| | 0/226 [00:00<?, ? examples/s]

end{sphinxVerbatim}

Map: 0%| | 0/226 [00:00<?, ? examples/s]

Map: 100%|██████████| 226/226 [00:00&lt;00:00, 5302.95 examples/s]

</pre>

Map: 100%|██████████| 226/226 [00:00<00:00, 5302.95 examples/s]

end{sphinxVerbatim}

Map: 100%|██████████| 226/226 [00:00<00:00, 5302.95 examples/s]


Filter -&gt; age:[20 - 50): 0%| | 0/226 [00:00&lt;?, ? examples/s]

</pre>

Filter -> age:[20 - 50): 0%| | 0/226 [00:00<?, ? examples/s]

end{sphinxVerbatim}

Filter -> age:[20 - 50): 0%| | 0/226 [00:00<?, ? examples/s]

Filter -&gt; age:[20 - 50): 100%|██████████| 226/226 [00:00&lt;00:00, 16173.50 examples/s]

</pre>

Filter -> age:[20 - 50): 100%|██████████| 226/226 [00:00<00:00, 16173.50 examples/s]

end{sphinxVerbatim}

Filter -> age:[20 - 50): 100%|██████████| 226/226 [00:00<00:00, 16173.50 examples/s]


Filter -&gt; age:[50 - 80): 0%| | 0/226 [00:00&lt;?, ? examples/s]

</pre>

Filter -> age:[50 - 80): 0%| | 0/226 [00:00<?, ? examples/s]

end{sphinxVerbatim}

Filter -> age:[50 - 80): 0%| | 0/226 [00:00<?, ? examples/s]

Filter -&gt; age:[50 - 80): 100%|██████████| 226/226 [00:00&lt;00:00, 11568.66 examples/s]

</pre>

Filter -> age:[50 - 80): 100%|██████████| 226/226 [00:00<00:00, 11568.66 examples/s]

end{sphinxVerbatim}

Filter -> age:[50 - 80): 100%|██████████| 226/226 [00:00<00:00, 11568.66 examples/s]


Filter -&gt; gender:M: 0%| | 0/226 [00:00&lt;?, ? examples/s]

</pre>

Filter -> gender:M: 0%| | 0/226 [00:00<?, ? examples/s]

end{sphinxVerbatim}

Filter -> gender:M: 0%| | 0/226 [00:00<?, ? examples/s]

Filter -&gt; gender:M: 100%|██████████| 226/226 [00:00&lt;00:00, 11637.83 examples/s]

</pre>

Filter -> gender:M: 100%|██████████| 226/226 [00:00<00:00, 11637.83 examples/s]

end{sphinxVerbatim}

Filter -> gender:M: 100%|██████████| 226/226 [00:00<00:00, 11637.83 examples/s]


Filter -&gt; gender:F: 0%| | 0/226 [00:00&lt;?, ? examples/s]

</pre>

Filter -> gender:F: 0%| | 0/226 [00:00<?, ? examples/s]

end{sphinxVerbatim}

Filter -> gender:F: 0%| | 0/226 [00:00<?, ? examples/s]

Filter -&gt; gender:F: 100%|██████████| 226/226 [00:00&lt;00:00, 12320.80 examples/s]

</pre>

Filter -> gender:F: 100%|██████████| 226/226 [00:00<00:00, 12320.80 examples/s]

end{sphinxVerbatim}

Filter -> gender:F: 100%|██████████| 226/226 [00:00<00:00, 12320.80 examples/s]


Filter -&gt; overall: 0%| | 0/226 [00:00&lt;?, ? examples/s]

</pre>

Filter -> overall: 0%| | 0/226 [00:00<?, ? examples/s]

end{sphinxVerbatim}

Filter -> overall: 0%| | 0/226 [00:00<?, ? examples/s]

Filter -&gt; overall: 100%|██████████| 226/226 [00:00&lt;00:00, 12628.06 examples/s]

</pre>

Filter -> overall: 100%|██████████| 226/226 [00:00<00:00, 12628.06 examples/s]

end{sphinxVerbatim}

Filter -> overall: 100%|██████████| 226/226 [00:00<00:00, 12628.06 examples/s]


Filter -&gt; gender:F&amp;age:(-inf - 20.0]: 0%| | 0/226 [00:00&lt;?, ? examples/s]

</pre>

Filter -> gender:F&age:(-inf - 20.0]: 0%| | 0/226 [00:00<?, ? examples/s]

end{sphinxVerbatim}

Filter -> gender:F&age:(-inf - 20.0]: 0%| | 0/226 [00:00<?, ? examples/s]

Filter -&gt; gender:F&amp;age:(-inf - 20.0]: 100%|██████████| 226/226 [00:00&lt;00:00, 15639.80 examples/s]

</pre>

Filter -> gender:F&age:(-inf - 20.0]: 100%|██████████| 226/226 [00:00<00:00, 15639.80 examples/s]

end{sphinxVerbatim}

Filter -> gender:F&age:(-inf - 20.0]: 100%|██████████| 226/226 [00:00<00:00, 15639.80 examples/s]


Filter -&gt; gender:F&amp;age:(20.0 - 40.0]: 0%| | 0/226 [00:00&lt;?, ? examples/s]

</pre>

Filter -> gender:F&age:(20.0 - 40.0]: 0%| | 0/226 [00:00<?, ? examples/s]

end{sphinxVerbatim}

Filter -> gender:F&age:(20.0 - 40.0]: 0%| | 0/226 [00:00<?, ? examples/s]

Filter -&gt; gender:F&amp;age:(20.0 - 40.0]: 100%|██████████| 226/226 [00:00&lt;00:00, 16129.74 examples/s]

</pre>

Filter -> gender:F&age:(20.0 - 40.0]: 100%|██████████| 226/226 [00:00<00:00, 16129.74 examples/s]

end{sphinxVerbatim}

Filter -> gender:F&age:(20.0 - 40.0]: 100%|██████████| 226/226 [00:00<00:00, 16129.74 examples/s]


Filter -&gt; gender:F&amp;age:(40.0 - inf]: 0%| | 0/226 [00:00&lt;?, ? examples/s]

</pre>

Filter -> gender:F&age:(40.0 - inf]: 0%| | 0/226 [00:00<?, ? examples/s]

end{sphinxVerbatim}

Filter -> gender:F&age:(40.0 - inf]: 0%| | 0/226 [00:00<?, ? examples/s]

Filter -&gt; gender:F&amp;age:(40.0 - inf]: 100%|██████████| 226/226 [00:00&lt;00:00, 16565.24 examples/s]

</pre>

Filter -> gender:F&age:(40.0 - inf]: 100%|██████████| 226/226 [00:00<00:00, 16565.24 examples/s]

end{sphinxVerbatim}

Filter -> gender:F&age:(40.0 - inf]: 100%|██████████| 226/226 [00:00<00:00, 16565.24 examples/s]


Filter -&gt; gender:M&amp;age:(-inf - 20.0]: 0%| | 0/226 [00:00&lt;?, ? examples/s]

</pre>

Filter -> gender:M&age:(-inf - 20.0]: 0%| | 0/226 [00:00<?, ? examples/s]

end{sphinxVerbatim}

Filter -> gender:M&age:(-inf - 20.0]: 0%| | 0/226 [00:00<?, ? examples/s]

Filter -&gt; gender:M&amp;age:(-inf - 20.0]: 100%|██████████| 226/226 [00:00&lt;00:00, 16303.69 examples/s]

</pre>

Filter -> gender:M&age:(-inf - 20.0]: 100%|██████████| 226/226 [00:00<00:00, 16303.69 examples/s]

end{sphinxVerbatim}

Filter -> gender:M&age:(-inf - 20.0]: 100%|██████████| 226/226 [00:00<00:00, 16303.69 examples/s]


Filter -&gt; gender:M&amp;age:(20.0 - 40.0]: 0%| | 0/226 [00:00&lt;?, ? examples/s]

</pre>

Filter -> gender:M&age:(20.0 - 40.0]: 0%| | 0/226 [00:00<?, ? examples/s]

end{sphinxVerbatim}

Filter -> gender:M&age:(20.0 - 40.0]: 0%| | 0/226 [00:00<?, ? examples/s]

Filter -&gt; gender:M&amp;age:(20.0 - 40.0]: 100%|██████████| 226/226 [00:00&lt;00:00, 16317.72 examples/s]

</pre>

Filter -> gender:M&age:(20.0 - 40.0]: 100%|██████████| 226/226 [00:00<00:00, 16317.72 examples/s]

end{sphinxVerbatim}

Filter -> gender:M&age:(20.0 - 40.0]: 100%|██████████| 226/226 [00:00<00:00, 16317.72 examples/s]


Filter -&gt; gender:M&amp;age:(40.0 - inf]: 0%| | 0/226 [00:00&lt;?, ? examples/s]

</pre>

Filter -> gender:M&age:(40.0 - inf]: 0%| | 0/226 [00:00<?, ? examples/s]

end{sphinxVerbatim}

Filter -> gender:M&age:(40.0 - inf]: 0%| | 0/226 [00:00<?, ? examples/s]

Filter -&gt; gender:M&amp;age:(40.0 - inf]: 100%|██████████| 226/226 [00:00&lt;00:00, 15568.65 examples/s]

</pre>

Filter -> gender:M&age:(40.0 - inf]: 100%|██████████| 226/226 [00:00<00:00, 15568.65 examples/s]

end{sphinxVerbatim}

Filter -> gender:M&age:(40.0 - inf]: 100%|██████████| 226/226 [00:00<00:00, 15568.65 examples/s]


Log the performance metrics to the report.

We can add a performance metric to the model card using the log_performance_metric method, which expects a dictionary where the keys are in the following format: slice_name/metric_name. For instance, overall/accuracy.

We first need to process the evaluation results to get the metrics in the right format.

[35]:
model_name = f"model_for_preds.{model_name}"
results_flat = flatten_results_dict(
    results=results,
    remove_metrics=["BinaryROC", "BinaryPrecisionRecallCurve"],
    model_name=model_name,
)
[36]:
for name, metric in results_flat.items():
    split, name = name.split("/")  # noqa: PLW2901
    if name == "BinaryConfusionMatrix":
        continue
    descriptions = {
        "BinaryPrecision": "The proportion of predicted positive instances that are correctly predicted.",
        "BinaryRecall": "The proportion of actual positive instances that are correctly predicted. Also known as recall or true positive rate.",
        "BinaryAccuracy": "The proportion of all instances that are correctly predicted.",
        "BinaryAUROC": "The area under the receiver operating characteristic curve (AUROC) is a measure of the performance of a binary classification model.",
        "BinaryF1Score": "The harmonic mean of precision and recall.",
    }
    report.log_quantitative_analysis(
        "performance",
        name=name,
        value=metric.tolist(),
        description=descriptions[name],
        metric_slice=split,
        pass_fail_thresholds=0.7,
        pass_fail_threshold_fns=lambda x, threshold: bool(x >= threshold),
    )

We can also use the ClassificationPlotter to plot the performance metrics and the add the figure to the model card using the log_plotly_figure method.

[37]:
plotter = ClassificationPlotter(task_type="binary", class_names=["0", "1"])
plotter.set_template("plotly_white")
[38]:
# extracting the ROC curves and AUROC results for all the slices
roc_curves = {
    slice_name: slice_results["BinaryROC"]
    for slice_name, slice_results in results[model_name].items()
}
aurocs = {
    slice_name: slice_results["BinaryAUROC"]
    for slice_name, slice_results in results[model_name].items()
}
roc_curves.keys()
[38]:
dict_keys(['age:[20 - 50)', 'age:[50 - 80)', 'gender:M', 'gender:F', 'overall'])
[39]:
# Plot confusion matrix
confusion_matrix = results[model_name]["overall"]["BinaryConfusionMatrix"]
conf_plot = plotter.confusion_matrix(
    confusion_matrix,
)
report.log_plotly_figure(
    fig=conf_plot,
    caption="Confusion Matrix",
    section_name="quantitative analysis",
)
conf_plot.show()
[40]:
# plotting the ROC curves for all the slices
roc_plot = plotter.roc_curve_comparison(roc_curves, aurocs=aurocs)
report.log_plotly_figure(
    fig=roc_plot,
    caption="ROC Curve for Female Patients",
    section_name="quantitative analysis",
)
roc_plot.show()
[41]:
# Extracting the overall classification metric values.
overall_performance = {
    metric_name: metric_value
    for metric_name, metric_value in results[model_name]["overall"].items()
    if metric_name
    not in ["BinaryROC", "BinaryPrecisionRecallCurve", "BinaryConfusionMatrix"]
}
[42]:
# Plotting the overall classification metric values.
overall_performance_plot = plotter.metrics_value(
    overall_performance,
    title="Overall Performance",
)
report.log_plotly_figure(
    fig=overall_performance_plot,
    caption="Overall Performance",
    section_name="quantitative analysis",
)
overall_performance_plot.show()
[43]:
# Extracting the metric values for all the slices.
slice_metrics = {
    slice_name: {
        metric_name: metric_value
        for metric_name, metric_value in slice_results.items()
        if metric_name
        not in ["BinaryROCCurve", "BinaryPrecisionRecallCurve", "BinaryConfusionMatrix"]
    }
    for slice_name, slice_results in results[model_name].items()
}
[44]:
# Plotting the metric values for all the slices.
slice_metrics_plot = plotter.metrics_comparison_bar(slice_metrics)
report.log_plotly_figure(
    fig=slice_metrics_plot,
    caption="Slice Metric Comparison",
    section_name="quantitative analysis",
)
slice_metrics_plot.show()
[45]:
# Reformatting the fairness metrics
fairness_results = copy.deepcopy(results["fairness"])
fairness_metrics = {}
# remove the group size from the fairness results and add it to the slice name
for slice_name, slice_results in fairness_results.items():
    group_size = slice_results.pop("Group Size")
    fairness_metrics[f"{slice_name} (Size={group_size})"] = slice_results
[46]:
# Plotting the fairness metrics
fairness_plot = plotter.metrics_comparison_scatter(
    fairness_metrics,
    title="Fairness Metrics",
)
report.log_plotly_figure(
    fig=fairness_plot,
    caption="Fairness Metrics",
    section_name="fairness analysis",
)
fairness_plot.show()

Report Generation#

Before generating the model card, let us document some of the details of the model and some considerations involved in developing and using the model.

Let’s start with populating the model details section, which includes the following fields by default: - description: A high-level description of the model and its usage for a general audience. - version: The version of the model. - owners: The individuals or organizations that own the model. - license: The license under which the model is made available. - citation: The citation for the model. - references: Links to resources that are relevant to the model. - path: The path to where the model is stored. - regulatory_requirements: The regulatory requirements that are relevant to the model.

We can add additional fields to the model details section by passing a dictionary to the log_from_dict method and specifying the section name as model_details. You can also use the log_descriptor method to add a new field object with a description attribute to any section of the model card.

[47]:
report.log_from_dict(
    data={
        "name": "Prolonged Length of Stay Prediction Model",
        "description": "The model was trained on the Synthea synthetic dataset \
            to predict prolonged stay in the hospital.",
    },
    section_name="model_details",
)
report.log_version(
    version_str="0.0.1",
    date=str(date.today()),
    description="Initial Release",
)
report.log_owner(
    name="CyclOps Team",
    contact="vectorinstitute.github.io/cyclops/",
    email="cyclops@vectorinstitute.ai",
)
report.log_license(identifier="Apache-2.0")
report.log_reference(
    link="https://xgboost.readthedocs.io/en/stable/python/python_api.html",  # noqa: E501
)

Next, let’s populate the considerations section, which includes the following fields by default: - users: The intended users of the model. - use_cases: The use cases for the model. These could be primary, downstream or out-of-scope use cases. - fairness_assessment: A description of the benefits and harms of the model for different groups as well as the steps taken to mitigate the harms. - ethical_considerations: The risks associated with using the model and the steps taken to mitigate them. This can be populated using the log_risk method.

[48]:
report.log_from_dict(
    data={
        "users": [
            {"description": "Hospitals"},
            {"description": "Clinicians"},
        ],
    },
    section_name="considerations",
)
report.log_user(description="ML Engineers")
report.log_use_case(
    description="Predicting prolonged length of stay",
    kind="primary",
)
report.log_fairness_assessment(
    affected_group="sex, age",
    benefit="Improved health outcomes for patients.",
    harm="Biased predictions for patients in certain groups (e.g. older patients) \
        may lead to worse health outcomes.",
    mitigation_strategy="We will monitor the performance of the model on these groups \
        and retrain the model if the performance drops below a certain threshold.",
)
report.log_risk(
    risk="The model may be used to make decisions that affect the health of patients.",
    mitigation_strategy="The model should be continuously monitored for performance \
        and retrained if the performance drops below a certain threshold.",
)

Once the model card is populated, you can generate the report using the export method. The report is generated in the form of an HTML file. A JSON file containing the model card data will also be generated along with the HTML file. By default, the files will be saved in a folder named cyclops_reports in the current working directory. You can change the path by passing a output_dir argument when instantiating the ModelCardReport class.

[49]:
synthetic_timestamps = [
    "2021-09-01",
    "2021-10-01",
    "2021-11-01",
    "2021-12-01",
    "2022-01-01",
]
report._model_card.overview = None
report_path = report.export(
    output_filename="length_of_stay_report_periodic.html",
    synthetic_timestamp=synthetic_timestamps[0],
)
shutil.copy(f"{report_path}", ".")
for i in range(4):
    report._model_card.overview = None
    for metric in report._model_card.quantitative_analysis.performance_metrics:
        metric.value = np.clip(
            metric.value + np.random.normal(0, 0.1),
            0,
            1,
        )
        metric.tests[0].passed = bool(metric.value >= 0.7)
    report_path = report.export(
        output_filename="length_of_stay_report_periodic.html",
        synthetic_timestamp=synthetic_timestamps[i + 1],
    )
    shutil.copy(f"{report_path}", ".")
shutil.rmtree("./cyclops_report")

You can view the generated HTML report.