A library for developing foundation models using Electronic Health Records (EHR) data.
Visit our recent EHRMamba paper
Odyssey is a comprehensive library designed to facilitate the development, training, and deployment of foundation models for Electronic Health Records (EHR). Recently, we used this toolkit to develop EHRMamba, a cutting-edge EHR foundation model that leverages the Mamba architecture and Multitask Prompted Finetuning (MPF) to overcome the limitations of existing transformer-based models. EHRMamba excels in processing long temporal sequences, simultaneously learning multiple clinical tasks, and performing EHR forecasting, significantly advancing the state of the art in EHR modeling.
The toolkit is structured into four main modules to streamline the development process:
The data extraction and preprocessing pipeline requires running the repository located at MEDS repository. The pipeline extracts and preprocesses the MIMIC-IV dataset to generate a patients’ sequence of events.
Clone and install the required repository locally:
git clone --branch odyssey https://github.com/VectorInstitute/meds.git
cd meds/MIMIC-IV_Example
pip install .
As mentioned in the MEDS repository two (optional) hydra multirun job launchers for parallelizing extraction and pre-processing pipeline steps: joblib
(for local parallelism) and submitit
to launch things with slurm for cluster parallelism.
To use either of these, you need to install additional optional dependencies:
pip install -e .[local_parallelism]
for joblib local parallelism support, orpip install -e .[slurm_parallelism]
for submitit cluster parallelism support.The run_extract.sh
script performs the following steps:
pre_MEDS
pipeline.extract
pipeline, which:
Note that the events that are extracted and included in the MEDS cohort are defined based on the event configs files.
Run the extract pipeline using:
./run_extract.sh path_to_raw_data_dir path_to_preMEDS_dir path_to_MEDS_dir
do_unzip=true|false
(Optional) Unzip CSV files before processing (default: false).batch_files
Run batch_files.py
before processing (requires extra arguments):
--lab_input=<path>
(Required if batch_files
is set) Path to labevents
CSV.--chart_input=<path>
(Required if batch_files
is set) Path to chartevents
CSV.To use a specific stage runner file (e.g., to set different parallelism options), you can specify it as an additional argument
export N_WORKERS=5
./run_extract.sh path_to_raw_data_dir path_to_preMEDS_dir path_to_Extract_dir \
stage_runner_fp=slurm_runner.yaml
The N_WORKERS
environment variable set before the command controls how many parallel workers should be used
at maximum.
The run_preprocess
script executes the following steps:
0
in our case).hour2bin
and minute2bin
).Run the preprocess pipeline using:
./run_preprocess.sh path_to_Extract_dir path_to_Processed_DIR
To customize the default parameters for each pipeline step, modify the following configuration files:
extract_MIMIC_seq.yaml
preprocess_MIMIC_seq.yaml
We welcome contributions from the community! Please open an issue.
If you use EHRMamba or Odyssey in your research, please cite our paper:
@misc{fallahpour2024ehrmamba,
title={EHRMamba: Towards Generalizable and Scalable Foundation Models for Electronic Health Records},
author={Adibvafa Fallahpour and Mahshid Alinoori and Arash Afkanpour and Amrit Krishnan},
year={2024},
eprint={2405.14567},
archivePrefix={arXiv},
primaryClass={cs.LG}
}