Skip to content

Comments

HALO Implementation into PyHealth + Colab Notebook#812

Draft
jalengg wants to merge 35 commits intosunlabuiuc:masterfrom
jalengg:halo-pr-528
Draft

HALO Implementation into PyHealth + Colab Notebook#812
jalengg wants to merge 35 commits intosunlabuiuc:masterfrom
jalengg:halo-pr-528

Conversation

@jalengg
Copy link

@jalengg jalengg commented Feb 4, 2026

No description provided.

chufangao and others added 11 commits June 15, 2025 13:04
- Add HALO (Healthcare generative model using transformers) implementation
- Include example training script with configurable parameters
- Include example generation script for synthetic patient data
- Add canonical SLURM scripts with optimal parameters (80 epochs, batch_size 48, lr 0.0001)
- Register HALO in generators module
- Update HALO_MIMIC3Dataset with latest preprocessing
- Update README with HALO documentation
@jalengg jalengg marked this pull request as ready for review February 14, 2026 23:43
@jalengg jalengg marked this pull request as draft February 14, 2026 23:43
Remove README.rst changes that only documented CorGAN, not HALO.
This PR should focus solely on HALO implementation.
…ls to HALO notebook

Complete Tasks 3-7:
- Configuration panel with demo defaults
- Data upload with validation
- Training logic with checkpoint management
- Generation with CSV conversion
- Results display with quality checks and download

Notebook now has 24 cells with complete end-to-end workflow.
- Replace `!pip install` with subprocess.run() for error checking
- Show clear error message if installation fails
- Raise RuntimeError to stop notebook execution on failure

Fixes #1

import os
import sys
sys.path.insert(0, '/u/jalenj4/PyHealth')
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove

- Remove PATIENTS.csv and patient_ids.txt (not used by HALO_MIMIC3Dataset)
- Handle Colab file renaming (ADMISSIONS (1).csv -> ADMISSIONS.csv)
- Allow uploading files one at a time with progress tracking
- Check Google Drive for existing files before requesting upload
- Add FORK variable to installation cell for easier testing

Fixes #4, #5, #6
Ensures Colab users always get the latest version from GitHub without
using cached packages. Critical for picking up recent fixes like the
halo_resources __init__.py.

Fixes #18
Use os.path.join() instead of string concatenation to properly handle
directory paths with or without trailing slashes.

Fixes #19
Fixes #21

The YAML config files in pyhealth/datasets/configs/ were not being
included when the package was installed via pip. This caused
FileNotFoundError for multiple datasets including HALO, MIMIC3,
MIMIC4, EHRShot, COVID-19 CXR, and Medical Transcriptions.

Added MANIFEST.in to specify which non-Python files should be
included in the package distribution.
Fixes #21

MANIFEST.in only affects sdist source distributions. When installing
via `pip install git+https://...` (as in Colab), pip relies on
package_data in setup.py to include non-Python files.

Added explicit package_data to ensure YAML configs in
pyhealth/datasets/configs/ are included in all install paths.
Removed MANIFEST.in as it provided no benefit for pip-from-git installs.
jalengg and others added 6 commits February 17, 2026 02:56
Timestamp reflects when notebook was last modified so users can
verify they are running the correct version. Reverts dynamic
install-time timestamp in favor of this static header approach.
When pkl_data_dir has no trailing slash (e.g. "/path/to/pkl_data"),
raw f-string concatenation produced invalid paths like
"/path/to/pkl_datacodeToIndex.pkl" instead of
"/path/to/pkl_data/codeToIndex.pkl".

Replace all pickle save paths with os.path.join(). Also add
os.makedirs() so the output directory is created if missing.
"cell_type": "markdown",
"metadata": {},
"source": "# HALO Synthetic Data Generation for MIMIC-III\n\n_Last updated: 2026-02-17 02:54:44_\n\nThis notebook trains the HALO (Hierarchical Autoregressive Language mOdel) on your MIMIC-III data and generates synthetic patients.\n\n## What You'll Need\n\n1. **MIMIC-III Access**: Download these files from PhysioNet:\n - `ADMISSIONS.csv`\n - `DIAGNOSES_ICD.csv`\n\n2. **Google Colab**: Free tier works, but GPU recommended (Runtime \u2192 Change runtime type \u2192 GPU)\n\n3. **Time**:\n - Demo (5 epochs, 1K samples): ~20-30 min on GPU\n - Production (80 epochs, 10K samples): ~6-10 hours on GPU\n\n## How It Works\n\n1. **Setup**: Install PyHealth and mount Google Drive\n2. **Upload Data**: Upload your MIMIC-III CSV files\n3. **Configure**: Set hyperparameters (epochs, batch size, etc.)\n4. **Train**: Train HALO model (checkpoints saved to Drive)\n5. **Generate**: Create synthetic patients using trained model\n6. **Download**: Get CSV file with synthetic data\n\n## Important Notes\n\n\u26a0\ufe0f **Colab Timeout**: Free Colab sessions timeout after 12 hours. For production training (80 epochs), consider:\n- Colab Pro for longer sessions\n- Running on your own GPU cluster using `examples/slurm/train_halo_mimic3.slurm`\n\n\ud83d\udcca **Demo vs Production**:\n- Demo defaults (5 epochs, 1K samples) let you test the pipeline quickly\n- Production settings (80 epochs, 10K samples) match the published HALO results\n\n## References\n\n- [HALO Paper](https://arxiv.org/abs/2406.16061)\n- [PyHealth Documentation](https://pyhealth.readthedocs.io/)\n- [MIMIC-III Access](https://physionet.org/content/mimiciii/)"
"source": "# HALO Synthetic Data Generation for MIMIC-III\n\n_Last updated: 2026-02-17 03:38:26_\n\nThis notebook trains the HALO (Hierarchical Autoregressive Language mOdel) on your MIMIC-III data and generates synthetic patients.\n\n## What You'll Need\n\n1. **MIMIC-III Access**: Download these files from PhysioNet:\n - `ADMISSIONS.csv`\n - `DIAGNOSES_ICD.csv`\n\n2. **Google Colab**: Free tier works, but GPU recommended (Runtime \u2192 Change runtime type \u2192 GPU)\n\n3. **Time**:\n - Demo (5 epochs, 1K samples): ~20-30 min on GPU\n - Production (80 epochs, 10K samples): ~6-10 hours on GPU\n\n## How It Works\n\n1. **Setup**: Install PyHealth and mount Google Drive\n2. **Upload Data**: Upload your MIMIC-III CSV files\n3. **Configure**: Set hyperparameters (epochs, batch size, etc.)\n4. **Train**: Train HALO model (checkpoints saved to Drive)\n5. **Generate**: Create synthetic patients using trained model\n6. **Download**: Get CSV file with synthetic data\n\n## Important Notes\n\n\u26a0\ufe0f **Colab Timeout**: Free Colab sessions timeout after 12 hours. For production training (80 epochs), consider:\n- Colab Pro for longer sessions\n- Running on your own GPU cluster using `examples/slurm/train_halo_mimic3.slurm`\n\n\ud83d\udcca **Demo vs Production**:\n- Demo defaults (5 epochs, 1K samples) let you test the pipeline quickly\n- Production settings (80 epochs, 10K samples) match the published HALO results\n\n## References\n\n- [HALO Paper](https://arxiv.org/abs/2406.16061)\n- [PyHealth Documentation](https://pyhealth.readthedocs.io/)\n- [MIMIC-III Access](https://physionet.org/content/mimiciii/)"
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shiitavie This is an example of how to commit changes to the notebook

Make sure you (manually) change the Last updated timestamp

"cell_type": "markdown",
"metadata": {},
"source": "# HALO Synthetic Data Generation for MIMIC-III\n\n_Last updated: 2026-02-17 03:38:26_\n\nThis notebook trains the HALO (Hierarchical Autoregressive Language mOdel) on your MIMIC-III data and generates synthetic patients.\n\n## What You'll Need\n\n1. **MIMIC-III Access**: Download these files from PhysioNet:\n - `ADMISSIONS.csv`\n - `DIAGNOSES_ICD.csv`\n\n2. **Google Colab**: Free tier works, but GPU recommended (Runtime \u2192 Change runtime type \u2192 GPU)\n\n3. **Time**:\n - Demo (5 epochs, 1K samples): ~20-30 min on GPU\n - Production (80 epochs, 10K samples): ~6-10 hours on GPU\n\n## How It Works\n\n1. **Setup**: Install PyHealth and mount Google Drive\n2. **Upload Data**: Upload your MIMIC-III CSV files\n3. **Configure**: Set hyperparameters (epochs, batch size, etc.)\n4. **Train**: Train HALO model (checkpoints saved to Drive)\n5. **Generate**: Create synthetic patients using trained model\n6. **Download**: Get CSV file with synthetic data\n\n## Important Notes\n\n\u26a0\ufe0f **Colab Timeout**: Free Colab sessions timeout after 12 hours. For production training (80 epochs), consider:\n- Colab Pro for longer sessions\n- Running on your own GPU cluster using `examples/slurm/train_halo_mimic3.slurm`\n\n\ud83d\udcca **Demo vs Production**:\n- Demo defaults (5 epochs, 1K samples) let you test the pipeline quickly\n- Production settings (80 epochs, 10K samples) match the published HALO results\n\n## References\n\n- [HALO Paper](https://arxiv.org/abs/2406.16061)\n- [PyHealth Documentation](https://pyhealth.readthedocs.io/)\n- [MIMIC-III Access](https://physionet.org/content/mimiciii/)"
"source": "# HALO Synthetic Data Generation for MIMIC-III\n\n_Last updated: 2026-02-17 03:38:26_\n\nThis notebook trains the HALO (Hierarchical Autoregressive Language mOdel) on your MIMIC-III data and generates synthetic patients.\n\n## What You'll Need\n\n1. **MIMIC-III Access**: Download these files from PhysioNet:\n - `ADMISSIONS.csv`\n - `DIAGNOSES_ICD.csv`\n\n2. **Google Colab**: Free tier works, but GPU recommended (Runtime → Change runtime type → GPU)\n\n3. **Time**:\n - Demo (5 epochs, 1K samples): ~20-30 min on GPU\n - Production (80 epochs, 10K samples): ~6-10 hours on GPU\n\n## How It Works\n\n1. **Setup**: Install PyHealth and mount Google Drive\n2. **Upload Data**: Upload your MIMIC-III CSV files\n3. **Configure**: Set hyperparameters (epochs, batch size, etc.)\n4. **Train**: Train HALO model (checkpoints saved to Drive)\n5. **Generate**: Create synthetic patients using trained model\n6. **Download**: Get CSV file with synthetic data\n\n## Important Notes\n\n⚠️ **Colab Timeout**: Free Colab sessions timeout after 12 hours. For production training (80 epochs), consider:\n- Colab Pro for longer sessions\n- Running on your own GPU cluster using `examples/slurm/train_halo_mimic3.slurm`\n\n📊 **Demo vs Production**:\n- Demo defaults (5 epochs, 1K samples) let you test the pipeline quickly\n- Production settings (80 epochs, 10K samples) match the published HALO results\n\n## References\n\n- [HALO Paper](https://arxiv.org/abs/2406.16061)\n- [PyHealth Documentation](https://pyhealth.readthedocs.io/)\n- [MIMIC-III Access](https://physionet.org/content/mimiciii/)"
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"source": "# HALO Synthetic Data Generation for MIMIC-III\n\n_Last updated: 2026-02-17 03:38:26_\n\nThis notebook trains the HALO (Hierarchical Autoregressive Language mOdel) on your MIMIC-III data and generates synthetic patients.\n\n## What You'll Need\n\n1. **MIMIC-III Access**: Download these files from PhysioNet:\n - `ADMISSIONS.csv`\n - `DIAGNOSES_ICD.csv`\n\n2. **Google Colab**: Free tier works, but GPU recommended (Runtime → Change runtime type → GPU)\n\n3. **Time**:\n - Demo (5 epochs, 1K samples): ~20-30 min on GPU\n - Production (80 epochs, 10K samples): ~6-10 hours on GPU\n\n## How It Works\n\n1. **Setup**: Install PyHealth and mount Google Drive\n2. **Upload Data**: Upload your MIMIC-III CSV files\n3. **Configure**: Set hyperparameters (epochs, batch size, etc.)\n4. **Train**: Train HALO model (checkpoints saved to Drive)\n5. **Generate**: Create synthetic patients using trained model\n6. **Download**: Get CSV file with synthetic data\n\n## Important Notes\n\n⚠️ **Colab Timeout**: Free Colab sessions timeout after 12 hours. For production training (80 epochs), consider:\n- Colab Pro for longer sessions\n- Running on your own GPU cluster using `examples/slurm/train_halo_mimic3.slurm`\n\n📊 **Demo vs Production**:\n- Demo defaults (5 epochs, 1K samples) let you test the pipeline quickly\n- Production settings (80 epochs, 10K samples) match the published HALO results\n\n## References\n\n- [HALO Paper](https://arxiv.org/abs/2406.16061)\n- [PyHealth Documentation](https://pyhealth.readthedocs.io/)\n- [MIMIC-III Access](https://physionet.org/content/mimiciii/)"
"source": "# HALO Synthetic Data Generation for MIMIC-III\n\n_Last updated: 2026-02-18 02:30:00_\n\nThis notebook trains the HALO (Hierarchical Autoregressive Language mOdel) on your MIMIC-III data and generates synthetic patients.\n\n## What You'll Need\n\n1. **MIMIC-III Access**: Download these files from PhysioNet:\n - `ADMISSIONS.csv`\n - `DIAGNOSES_ICD.csv`\n\n2. **Google Colab**: Free tier works, but GPU recommended (Runtime → Change runtime type → GPU)\n\n3. **Time**:\n - Demo (5 epochs, 1K samples): ~20-30 min on GPU\n - Production (80 epochs, 10K samples): ~6-10 hours on GPU\n\n## How It Works\n\n1. **Setup**: Install PyHealth and mount Google Drive\n2. **Upload Data**: Upload your MIMIC-III CSV files\n3. **Configure**: Set hyperparameters (epochs, batch size, etc.)\n4. **Train**: Train HALO model (checkpoints saved to Drive)\n5. **Generate**: Create synthetic patients using trained model\n6. **Download**: Get CSV file with synthetic data\n\n## Important Notes\n\n⚠️ **Colab Timeout**: Free Colab sessions timeout after 12 hours. For production training (80 epochs), consider:\n- Colab Pro for longer sessions\n- Running on your own GPU cluster using `examples/slurm/train_halo_mimic3.slurm`\n\n📊 **Demo vs Production**:\n- Demo defaults (5 epochs, 1K samples) let you test the pipeline quickly\n- Production settings (80 epochs, 10K samples) match the published HALO results\n\n## References\n\n- [HALO Paper](https://arxiv.org/abs/2406.16061)\n- [PyHealth Documentation](https://pyhealth.readthedocs.io/)\n- [MIMIC-III Access](https://physionet.org/content/mimiciii/)"

@jalengg jalengg changed the title Halo pr 528 HALO Implementation into PyHealth + Colab Notebook Feb 18, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants