HALO Implementation into PyHealth + Colab Notebook#812
Draft
jalengg wants to merge 35 commits intosunlabuiuc:masterfrom
Draft
HALO Implementation into PyHealth + Colab Notebook#812jalengg wants to merge 35 commits intosunlabuiuc:masterfrom
jalengg wants to merge 35 commits intosunlabuiuc:masterfrom
Conversation
- Add HALO (Healthcare generative model using transformers) implementation - Include example training script with configurable parameters - Include example generation script for synthetic patient data - Add canonical SLURM scripts with optimal parameters (80 epochs, batch_size 48, lr 0.0001) - Register HALO in generators module - Update HALO_MIMIC3Dataset with latest preprocessing - Update README with HALO documentation
Remove README.rst changes that only documented CorGAN, not HALO. This PR should focus solely on HALO implementation.
…ls to HALO notebook Complete Tasks 3-7: - Configuration panel with demo defaults - Data upload with validation - Training logic with checkpoint management - Generation with CSV conversion - Results display with quality checks and download Notebook now has 24 cells with complete end-to-end workflow.
- Replace `!pip install` with subprocess.run() for error checking - Show clear error message if installation fails - Raise RuntimeError to stop notebook execution on failure Fixes #1
jalengg
commented
Feb 16, 2026
|
|
||
| import os | ||
| import sys | ||
| sys.path.insert(0, '/u/jalenj4/PyHealth') |
- Remove PATIENTS.csv and patient_ids.txt (not used by HALO_MIMIC3Dataset) - Handle Colab file renaming (ADMISSIONS (1).csv -> ADMISSIONS.csv) - Allow uploading files one at a time with progress tracking - Check Google Drive for existing files before requesting upload - Add FORK variable to installation cell for easier testing Fixes #4, #5, #6
7 tasks
Ensures Colab users always get the latest version from GitHub without using cached packages. Critical for picking up recent fixes like the halo_resources __init__.py. Fixes #18
Use os.path.join() instead of string concatenation to properly handle directory paths with or without trailing slashes. Fixes #19
Fixes #21 The YAML config files in pyhealth/datasets/configs/ were not being included when the package was installed via pip. This caused FileNotFoundError for multiple datasets including HALO, MIMIC3, MIMIC4, EHRShot, COVID-19 CXR, and Medical Transcriptions. Added MANIFEST.in to specify which non-Python files should be included in the package distribution.
Fixes #21 MANIFEST.in only affects sdist source distributions. When installing via `pip install git+https://...` (as in Colab), pip relies on package_data in setup.py to include non-Python files. Added explicit package_data to ensure YAML configs in pyhealth/datasets/configs/ are included in all install paths. Removed MANIFEST.in as it provided no benefit for pip-from-git installs.
Timestamp reflects when notebook was last modified so users can verify they are running the correct version. Reverts dynamic install-time timestamp in favor of this static header approach.
When pkl_data_dir has no trailing slash (e.g. "/path/to/pkl_data"), raw f-string concatenation produced invalid paths like "/path/to/pkl_datacodeToIndex.pkl" instead of "/path/to/pkl_data/codeToIndex.pkl". Replace all pickle save paths with os.path.join(). Also add os.makedirs() so the output directory is created if missing.
jalengg
commented
Feb 18, 2026
examples/halo_mimic3_colab.ipynb
Outdated
| "cell_type": "markdown", | ||
| "metadata": {}, | ||
| "source": "# HALO Synthetic Data Generation for MIMIC-III\n\n_Last updated: 2026-02-17 02:54:44_\n\nThis notebook trains the HALO (Hierarchical Autoregressive Language mOdel) on your MIMIC-III data and generates synthetic patients.\n\n## What You'll Need\n\n1. **MIMIC-III Access**: Download these files from PhysioNet:\n - `ADMISSIONS.csv`\n - `DIAGNOSES_ICD.csv`\n\n2. **Google Colab**: Free tier works, but GPU recommended (Runtime \u2192 Change runtime type \u2192 GPU)\n\n3. **Time**:\n - Demo (5 epochs, 1K samples): ~20-30 min on GPU\n - Production (80 epochs, 10K samples): ~6-10 hours on GPU\n\n## How It Works\n\n1. **Setup**: Install PyHealth and mount Google Drive\n2. **Upload Data**: Upload your MIMIC-III CSV files\n3. **Configure**: Set hyperparameters (epochs, batch size, etc.)\n4. **Train**: Train HALO model (checkpoints saved to Drive)\n5. **Generate**: Create synthetic patients using trained model\n6. **Download**: Get CSV file with synthetic data\n\n## Important Notes\n\n\u26a0\ufe0f **Colab Timeout**: Free Colab sessions timeout after 12 hours. For production training (80 epochs), consider:\n- Colab Pro for longer sessions\n- Running on your own GPU cluster using `examples/slurm/train_halo_mimic3.slurm`\n\n\ud83d\udcca **Demo vs Production**:\n- Demo defaults (5 epochs, 1K samples) let you test the pipeline quickly\n- Production settings (80 epochs, 10K samples) match the published HALO results\n\n## References\n\n- [HALO Paper](https://arxiv.org/abs/2406.16061)\n- [PyHealth Documentation](https://pyhealth.readthedocs.io/)\n- [MIMIC-III Access](https://physionet.org/content/mimiciii/)" | ||
| "source": "# HALO Synthetic Data Generation for MIMIC-III\n\n_Last updated: 2026-02-17 03:38:26_\n\nThis notebook trains the HALO (Hierarchical Autoregressive Language mOdel) on your MIMIC-III data and generates synthetic patients.\n\n## What You'll Need\n\n1. **MIMIC-III Access**: Download these files from PhysioNet:\n - `ADMISSIONS.csv`\n - `DIAGNOSES_ICD.csv`\n\n2. **Google Colab**: Free tier works, but GPU recommended (Runtime \u2192 Change runtime type \u2192 GPU)\n\n3. **Time**:\n - Demo (5 epochs, 1K samples): ~20-30 min on GPU\n - Production (80 epochs, 10K samples): ~6-10 hours on GPU\n\n## How It Works\n\n1. **Setup**: Install PyHealth and mount Google Drive\n2. **Upload Data**: Upload your MIMIC-III CSV files\n3. **Configure**: Set hyperparameters (epochs, batch size, etc.)\n4. **Train**: Train HALO model (checkpoints saved to Drive)\n5. **Generate**: Create synthetic patients using trained model\n6. **Download**: Get CSV file with synthetic data\n\n## Important Notes\n\n\u26a0\ufe0f **Colab Timeout**: Free Colab sessions timeout after 12 hours. For production training (80 epochs), consider:\n- Colab Pro for longer sessions\n- Running on your own GPU cluster using `examples/slurm/train_halo_mimic3.slurm`\n\n\ud83d\udcca **Demo vs Production**:\n- Demo defaults (5 epochs, 1K samples) let you test the pipeline quickly\n- Production settings (80 epochs, 10K samples) match the published HALO results\n\n## References\n\n- [HALO Paper](https://arxiv.org/abs/2406.16061)\n- [PyHealth Documentation](https://pyhealth.readthedocs.io/)\n- [MIMIC-III Access](https://physionet.org/content/mimiciii/)" |
Author
There was a problem hiding this comment.
@shiitavie This is an example of how to commit changes to the notebook
Make sure you (manually) change the Last updated timestamp
jalengg
commented
Feb 18, 2026
| "cell_type": "markdown", | ||
| "metadata": {}, | ||
| "source": "# HALO Synthetic Data Generation for MIMIC-III\n\n_Last updated: 2026-02-17 03:38:26_\n\nThis notebook trains the HALO (Hierarchical Autoregressive Language mOdel) on your MIMIC-III data and generates synthetic patients.\n\n## What You'll Need\n\n1. **MIMIC-III Access**: Download these files from PhysioNet:\n - `ADMISSIONS.csv`\n - `DIAGNOSES_ICD.csv`\n\n2. **Google Colab**: Free tier works, but GPU recommended (Runtime \u2192 Change runtime type \u2192 GPU)\n\n3. **Time**:\n - Demo (5 epochs, 1K samples): ~20-30 min on GPU\n - Production (80 epochs, 10K samples): ~6-10 hours on GPU\n\n## How It Works\n\n1. **Setup**: Install PyHealth and mount Google Drive\n2. **Upload Data**: Upload your MIMIC-III CSV files\n3. **Configure**: Set hyperparameters (epochs, batch size, etc.)\n4. **Train**: Train HALO model (checkpoints saved to Drive)\n5. **Generate**: Create synthetic patients using trained model\n6. **Download**: Get CSV file with synthetic data\n\n## Important Notes\n\n\u26a0\ufe0f **Colab Timeout**: Free Colab sessions timeout after 12 hours. For production training (80 epochs), consider:\n- Colab Pro for longer sessions\n- Running on your own GPU cluster using `examples/slurm/train_halo_mimic3.slurm`\n\n\ud83d\udcca **Demo vs Production**:\n- Demo defaults (5 epochs, 1K samples) let you test the pipeline quickly\n- Production settings (80 epochs, 10K samples) match the published HALO results\n\n## References\n\n- [HALO Paper](https://arxiv.org/abs/2406.16061)\n- [PyHealth Documentation](https://pyhealth.readthedocs.io/)\n- [MIMIC-III Access](https://physionet.org/content/mimiciii/)" | ||
| "source": "# HALO Synthetic Data Generation for MIMIC-III\n\n_Last updated: 2026-02-17 03:38:26_\n\nThis notebook trains the HALO (Hierarchical Autoregressive Language mOdel) on your MIMIC-III data and generates synthetic patients.\n\n## What You'll Need\n\n1. **MIMIC-III Access**: Download these files from PhysioNet:\n - `ADMISSIONS.csv`\n - `DIAGNOSES_ICD.csv`\n\n2. **Google Colab**: Free tier works, but GPU recommended (Runtime → Change runtime type → GPU)\n\n3. **Time**:\n - Demo (5 epochs, 1K samples): ~20-30 min on GPU\n - Production (80 epochs, 10K samples): ~6-10 hours on GPU\n\n## How It Works\n\n1. **Setup**: Install PyHealth and mount Google Drive\n2. **Upload Data**: Upload your MIMIC-III CSV files\n3. **Configure**: Set hyperparameters (epochs, batch size, etc.)\n4. **Train**: Train HALO model (checkpoints saved to Drive)\n5. **Generate**: Create synthetic patients using trained model\n6. **Download**: Get CSV file with synthetic data\n\n## Important Notes\n\n⚠️ **Colab Timeout**: Free Colab sessions timeout after 12 hours. For production training (80 epochs), consider:\n- Colab Pro for longer sessions\n- Running on your own GPU cluster using `examples/slurm/train_halo_mimic3.slurm`\n\n📊 **Demo vs Production**:\n- Demo defaults (5 epochs, 1K samples) let you test the pipeline quickly\n- Production settings (80 epochs, 10K samples) match the published HALO results\n\n## References\n\n- [HALO Paper](https://arxiv.org/abs/2406.16061)\n- [PyHealth Documentation](https://pyhealth.readthedocs.io/)\n- [MIMIC-III Access](https://physionet.org/content/mimiciii/)" |
Author
There was a problem hiding this comment.
Suggested change
| "source": "# HALO Synthetic Data Generation for MIMIC-III\n\n_Last updated: 2026-02-17 03:38:26_\n\nThis notebook trains the HALO (Hierarchical Autoregressive Language mOdel) on your MIMIC-III data and generates synthetic patients.\n\n## What You'll Need\n\n1. **MIMIC-III Access**: Download these files from PhysioNet:\n - `ADMISSIONS.csv`\n - `DIAGNOSES_ICD.csv`\n\n2. **Google Colab**: Free tier works, but GPU recommended (Runtime → Change runtime type → GPU)\n\n3. **Time**:\n - Demo (5 epochs, 1K samples): ~20-30 min on GPU\n - Production (80 epochs, 10K samples): ~6-10 hours on GPU\n\n## How It Works\n\n1. **Setup**: Install PyHealth and mount Google Drive\n2. **Upload Data**: Upload your MIMIC-III CSV files\n3. **Configure**: Set hyperparameters (epochs, batch size, etc.)\n4. **Train**: Train HALO model (checkpoints saved to Drive)\n5. **Generate**: Create synthetic patients using trained model\n6. **Download**: Get CSV file with synthetic data\n\n## Important Notes\n\n⚠️ **Colab Timeout**: Free Colab sessions timeout after 12 hours. For production training (80 epochs), consider:\n- Colab Pro for longer sessions\n- Running on your own GPU cluster using `examples/slurm/train_halo_mimic3.slurm`\n\n📊 **Demo vs Production**:\n- Demo defaults (5 epochs, 1K samples) let you test the pipeline quickly\n- Production settings (80 epochs, 10K samples) match the published HALO results\n\n## References\n\n- [HALO Paper](https://arxiv.org/abs/2406.16061)\n- [PyHealth Documentation](https://pyhealth.readthedocs.io/)\n- [MIMIC-III Access](https://physionet.org/content/mimiciii/)" | |
| "source": "# HALO Synthetic Data Generation for MIMIC-III\n\n_Last updated: 2026-02-18 02:30:00_\n\nThis notebook trains the HALO (Hierarchical Autoregressive Language mOdel) on your MIMIC-III data and generates synthetic patients.\n\n## What You'll Need\n\n1. **MIMIC-III Access**: Download these files from PhysioNet:\n - `ADMISSIONS.csv`\n - `DIAGNOSES_ICD.csv`\n\n2. **Google Colab**: Free tier works, but GPU recommended (Runtime → Change runtime type → GPU)\n\n3. **Time**:\n - Demo (5 epochs, 1K samples): ~20-30 min on GPU\n - Production (80 epochs, 10K samples): ~6-10 hours on GPU\n\n## How It Works\n\n1. **Setup**: Install PyHealth and mount Google Drive\n2. **Upload Data**: Upload your MIMIC-III CSV files\n3. **Configure**: Set hyperparameters (epochs, batch size, etc.)\n4. **Train**: Train HALO model (checkpoints saved to Drive)\n5. **Generate**: Create synthetic patients using trained model\n6. **Download**: Get CSV file with synthetic data\n\n## Important Notes\n\n⚠️ **Colab Timeout**: Free Colab sessions timeout after 12 hours. For production training (80 epochs), consider:\n- Colab Pro for longer sessions\n- Running on your own GPU cluster using `examples/slurm/train_halo_mimic3.slurm`\n\n📊 **Demo vs Production**:\n- Demo defaults (5 epochs, 1K samples) let you test the pipeline quickly\n- Production settings (80 epochs, 10K samples) match the published HALO results\n\n## References\n\n- [HALO Paper](https://arxiv.org/abs/2406.16061)\n- [PyHealth Documentation](https://pyhealth.readthedocs.io/)\n- [MIMIC-III Access](https://physionet.org/content/mimiciii/)" |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.