HALO Implementation into PyHealth + Colab Notebook by jalengg · Pull Request #812 · sunlabuiuc/PyHealth

jalengg · 2026-02-04T02:55:28Z

No description provided.

- Add HALO (Healthcare generative model using transformers) implementation - Include example training script with configurable parameters - Include example generation script for synthetic patient data - Add canonical SLURM scripts with optimal parameters (80 epochs, batch_size 48, lr 0.0001) - Register HALO in generators module - Update HALO_MIMIC3Dataset with latest preprocessing - Update README with HALO documentation

Remove README.rst changes that only documented CorGAN, not HALO. This PR should focus solely on HALO implementation.

…ls to HALO notebook Complete Tasks 3-7: - Configuration panel with demo defaults - Data upload with validation - Training logic with checkpoint management - Generation with CSV conversion - Results display with quality checks and download Notebook now has 24 cells with complete end-to-end workflow.

- Replace `!pip install` with subprocess.run() for error checking - Show clear error message if installation fails - Raise RuntimeError to stop notebook execution on failure Fixes #1

jalengg · 2026-02-16T09:56:16Z

examples/generate_synthetic_mimic3_halo.py

+
+import os
+import sys
+sys.path.insert(0, '/u/jalenj4/PyHealth')


- Remove PATIENTS.csv and patient_ids.txt (not used by HALO_MIMIC3Dataset) - Handle Colab file renaming (ADMISSIONS (1).csv -> ADMISSIONS.csv) - Allow uploading files one at a time with progress tracking - Check Google Drive for existing files before requesting upload - Add FORK variable to installation cell for easier testing Fixes #4, #5, #6

Fixes #7

Ensures Colab users always get the latest version from GitHub without using cached packages. Critical for picking up recent fixes like the halo_resources __init__.py. Fixes #18

Use os.path.join() instead of string concatenation to properly handle directory paths with or without trailing slashes. Fixes #19

Fixes #21 The YAML config files in pyhealth/datasets/configs/ were not being included when the package was installed via pip. This caused FileNotFoundError for multiple datasets including HALO, MIMIC3, MIMIC4, EHRShot, COVID-19 CXR, and Medical Transcriptions. Added MANIFEST.in to specify which non-Python files should be included in the package distribution.

Fixes #21 MANIFEST.in only affects sdist source distributions. When installing via `pip install git+https://...` (as in Colab), pip relies on package_data in setup.py to include non-Python files. Added explicit package_data to ensure YAML configs in pyhealth/datasets/configs/ are included in all install paths. Removed MANIFEST.in as it provided no benefit for pip-from-git installs.

Timestamp reflects when notebook was last modified so users can verify they are running the correct version. Reverts dynamic install-time timestamp in favor of this static header approach.

When pkl_data_dir has no trailing slash (e.g. "/path/to/pkl_data"), raw f-string concatenation produced invalid paths like "/path/to/pkl_datacodeToIndex.pkl" instead of "/path/to/pkl_data/codeToIndex.pkl". Replace all pickle save paths with os.path.join(). Also add os.makedirs() so the output directory is created if missing.

jalengg · 2026-02-18T08:22:25Z

examples/halo_mimic3_colab.ipynb

   "cell_type": "markdown",
   "metadata": {},
-   "source": "# HALO Synthetic Data Generation for MIMIC-III\n\n_Last updated: 2026-02-17 02:54:44_\n\nThis notebook trains the HALO (Hierarchical Autoregressive Language mOdel) on your MIMIC-III data and generates synthetic patients.\n\n## What You'll Need\n\n1. **MIMIC-III Access**: Download these files from PhysioNet:\n   - `ADMISSIONS.csv`\n   - `DIAGNOSES_ICD.csv`\n\n2. **Google Colab**: Free tier works, but GPU recommended (Runtime \u2192 Change runtime type \u2192 GPU)\n\n3. **Time**:\n   - Demo (5 epochs, 1K samples): ~20-30 min on GPU\n   - Production (80 epochs, 10K samples): ~6-10 hours on GPU\n\n## How It Works\n\n1. **Setup**: Install PyHealth and mount Google Drive\n2. **Upload Data**: Upload your MIMIC-III CSV files\n3. **Configure**: Set hyperparameters (epochs, batch size, etc.)\n4. **Train**: Train HALO model (checkpoints saved to Drive)\n5. **Generate**: Create synthetic patients using trained model\n6. **Download**: Get CSV file with synthetic data\n\n## Important Notes\n\n\u26a0\ufe0f **Colab Timeout**: Free Colab sessions timeout after 12 hours. For production training (80 epochs), consider:\n- Colab Pro for longer sessions\n- Running on your own GPU cluster using `examples/slurm/train_halo_mimic3.slurm`\n\n\ud83d\udcca **Demo vs Production**:\n- Demo defaults (5 epochs, 1K samples) let you test the pipeline quickly\n- Production settings (80 epochs, 10K samples) match the published HALO results\n\n## References\n\n- [HALO Paper](https://arxiv.org/abs/2406.16061)\n- [PyHealth Documentation](https://pyhealth.readthedocs.io/)\n- [MIMIC-III Access](https://physionet.org/content/mimiciii/)"
+   "source": "# HALO Synthetic Data Generation for MIMIC-III\n\n_Last updated: 2026-02-17 03:38:26_\n\nThis notebook trains the HALO (Hierarchical Autoregressive Language mOdel) on your MIMIC-III data and generates synthetic patients.\n\n## What You'll Need\n\n1. **MIMIC-III Access**: Download these files from PhysioNet:\n   - `ADMISSIONS.csv`\n   - `DIAGNOSES_ICD.csv`\n\n2. **Google Colab**: Free tier works, but GPU recommended (Runtime \u2192 Change runtime type \u2192 GPU)\n\n3. **Time**:\n   - Demo (5 epochs, 1K samples): ~20-30 min on GPU\n   - Production (80 epochs, 10K samples): ~6-10 hours on GPU\n\n## How It Works\n\n1. **Setup**: Install PyHealth and mount Google Drive\n2. **Upload Data**: Upload your MIMIC-III CSV files\n3. **Configure**: Set hyperparameters (epochs, batch size, etc.)\n4. **Train**: Train HALO model (checkpoints saved to Drive)\n5. **Generate**: Create synthetic patients using trained model\n6. **Download**: Get CSV file with synthetic data\n\n## Important Notes\n\n\u26a0\ufe0f **Colab Timeout**: Free Colab sessions timeout after 12 hours. For production training (80 epochs), consider:\n- Colab Pro for longer sessions\n- Running on your own GPU cluster using `examples/slurm/train_halo_mimic3.slurm`\n\n\ud83d\udcca **Demo vs Production**:\n- Demo defaults (5 epochs, 1K samples) let you test the pipeline quickly\n- Production settings (80 epochs, 10K samples) match the published HALO results\n\n## References\n\n- [HALO Paper](https://arxiv.org/abs/2406.16061)\n- [PyHealth Documentation](https://pyhealth.readthedocs.io/)\n- [MIMIC-III Access](https://physionet.org/content/mimiciii/)"


@shiitavie This is an example of how to commit changes to the notebook

Make sure you (manually) change the Last updated timestamp

jalengg · 2026-02-18T08:30:12Z

examples/halo_mimic3_colab.ipynb

   "cell_type": "markdown",
   "metadata": {},
-   "source": "# HALO Synthetic Data Generation for MIMIC-III\n\n_Last updated: 2026-02-17 03:38:26_\n\nThis notebook trains the HALO (Hierarchical Autoregressive Language mOdel) on your MIMIC-III data and generates synthetic patients.\n\n## What You'll Need\n\n1. **MIMIC-III Access**: Download these files from PhysioNet:\n   - `ADMISSIONS.csv`\n   - `DIAGNOSES_ICD.csv`\n\n2. **Google Colab**: Free tier works, but GPU recommended (Runtime \u2192 Change runtime type \u2192 GPU)\n\n3. **Time**:\n   - Demo (5 epochs, 1K samples): ~20-30 min on GPU\n   - Production (80 epochs, 10K samples): ~6-10 hours on GPU\n\n## How It Works\n\n1. **Setup**: Install PyHealth and mount Google Drive\n2. **Upload Data**: Upload your MIMIC-III CSV files\n3. **Configure**: Set hyperparameters (epochs, batch size, etc.)\n4. **Train**: Train HALO model (checkpoints saved to Drive)\n5. **Generate**: Create synthetic patients using trained model\n6. **Download**: Get CSV file with synthetic data\n\n## Important Notes\n\n\u26a0\ufe0f **Colab Timeout**: Free Colab sessions timeout after 12 hours. For production training (80 epochs), consider:\n- Colab Pro for longer sessions\n- Running on your own GPU cluster using `examples/slurm/train_halo_mimic3.slurm`\n\n\ud83d\udcca **Demo vs Production**:\n- Demo defaults (5 epochs, 1K samples) let you test the pipeline quickly\n- Production settings (80 epochs, 10K samples) match the published HALO results\n\n## References\n\n- [HALO Paper](https://arxiv.org/abs/2406.16061)\n- [PyHealth Documentation](https://pyhealth.readthedocs.io/)\n- [MIMIC-III Access](https://physionet.org/content/mimiciii/)"
+   "source": "# HALO Synthetic Data Generation for MIMIC-III\n\n_Last updated: 2026-02-17 03:38:26_\n\nThis notebook trains the HALO (Hierarchical Autoregressive Language mOdel) on your MIMIC-III data and generates synthetic patients.\n\n## What You'll Need\n\n1. **MIMIC-III Access**: Download these files from PhysioNet:\n   - `ADMISSIONS.csv`\n   - `DIAGNOSES_ICD.csv`\n\n2. **Google Colab**: Free tier works, but GPU recommended (Runtime → Change runtime type → GPU)\n\n3. **Time**:\n   - Demo (5 epochs, 1K samples): ~20-30 min on GPU\n   - Production (80 epochs, 10K samples): ~6-10 hours on GPU\n\n## How It Works\n\n1. **Setup**: Install PyHealth and mount Google Drive\n2. **Upload Data**: Upload your MIMIC-III CSV files\n3. **Configure**: Set hyperparameters (epochs, batch size, etc.)\n4. **Train**: Train HALO model (checkpoints saved to Drive)\n5. **Generate**: Create synthetic patients using trained model\n6. **Download**: Get CSV file with synthetic data\n\n## Important Notes\n\n⚠️ **Colab Timeout**: Free Colab sessions timeout after 12 hours. For production training (80 epochs), consider:\n- Colab Pro for longer sessions\n- Running on your own GPU cluster using `examples/slurm/train_halo_mimic3.slurm`\n\n📊 **Demo vs Production**:\n- Demo defaults (5 epochs, 1K samples) let you test the pipeline quickly\n- Production settings (80 epochs, 10K samples) match the published HALO results\n\n## References\n\n- [HALO Paper](https://arxiv.org/abs/2406.16061)\n- [PyHealth Documentation](https://pyhealth.readthedocs.io/)\n- [MIMIC-III Access](https://physionet.org/content/mimiciii/)"


Suggested change

"source": "# HALO Synthetic Data Generation for MIMIC-III\n\n_Last updated: 2026-02-17 03:38:26_\n\nThis notebook trains the HALO (Hierarchical Autoregressive Language mOdel) on your MIMIC-III data and generates synthetic patients.\n\n## What You'll Need\n\n1. **MIMIC-III Access**: Download these files from PhysioNet:\n - `ADMISSIONS.csv`\n - `DIAGNOSES_ICD.csv`\n\n2. **Google Colab**: Free tier works, but GPU recommended (Runtime → Change runtime type → GPU)\n\n3. **Time**:\n - Demo (5 epochs, 1K samples): ~20-30 min on GPU\n - Production (80 epochs, 10K samples): ~6-10 hours on GPU\n\n## How It Works\n\n1. **Setup**: Install PyHealth and mount Google Drive\n2. **Upload Data**: Upload your MIMIC-III CSV files\n3. **Configure**: Set hyperparameters (epochs, batch size, etc.)\n4. **Train**: Train HALO model (checkpoints saved to Drive)\n5. **Generate**: Create synthetic patients using trained model\n6. **Download**: Get CSV file with synthetic data\n\n## Important Notes\n\n⚠️ **Colab Timeout**: Free Colab sessions timeout after 12 hours. For production training (80 epochs), consider:\n- Colab Pro for longer sessions\n- Running on your own GPU cluster using `examples/slurm/train_halo_mimic3.slurm`\n\n📊 **Demo vs Production**:\n- Demo defaults (5 epochs, 1K samples) let you test the pipeline quickly\n- Production settings (80 epochs, 10K samples) match the published HALO results\n\n## References\n\n- [HALO Paper](https://arxiv.org/abs/2406.16061)\n- [PyHealth Documentation](https://pyhealth.readthedocs.io/)\n- [MIMIC-III Access](https://physionet.org/content/mimiciii/)"

"source": "# HALO Synthetic Data Generation for MIMIC-III\n\n_Last updated: 2026-02-18 02:30:00_\n\nThis notebook trains the HALO (Hierarchical Autoregressive Language mOdel) on your MIMIC-III data and generates synthetic patients.\n\n## What You'll Need\n\n1. **MIMIC-III Access**: Download these files from PhysioNet:\n - `ADMISSIONS.csv`\n - `DIAGNOSES_ICD.csv`\n\n2. **Google Colab**: Free tier works, but GPU recommended (Runtime → Change runtime type → GPU)\n\n3. **Time**:\n - Demo (5 epochs, 1K samples): ~20-30 min on GPU\n - Production (80 epochs, 10K samples): ~6-10 hours on GPU\n\n## How It Works\n\n1. **Setup**: Install PyHealth and mount Google Drive\n2. **Upload Data**: Upload your MIMIC-III CSV files\n3. **Configure**: Set hyperparameters (epochs, batch size, etc.)\n4. **Train**: Train HALO model (checkpoints saved to Drive)\n5. **Generate**: Create synthetic patients using trained model\n6. **Download**: Get CSV file with synthetic data\n\n## Important Notes\n\n⚠️ **Colab Timeout**: Free Colab sessions timeout after 12 hours. For production training (80 epochs), consider:\n- Colab Pro for longer sessions\n- Running on your own GPU cluster using `examples/slurm/train_halo_mimic3.slurm`\n\n📊 **Demo vs Production**:\n- Demo defaults (5 epochs, 1K samples) let you test the pipeline quickly\n- Production settings (80 epochs, 10K samples) match the published HALO results\n\n## References\n\n- [HALO Paper](https://arxiv.org/abs/2406.16061)\n- [PyHealth Documentation](https://pyhealth.readthedocs.io/)\n- [MIMIC-III Access](https://physionet.org/content/mimiciii/)"

chufangao and others added 11 commits June 15, 2025 13:04

init generators commit

d1f97af

base

ee8c52c

Stab at implementation

00f10c2

Misc. changes for testing

b666f82

Remove testing logs

ec4f23d

Clean up things a bit

4ce8e21

Clean up hardcoded file path

b1584fd

Remove testing files from PR

d374603

Init model properly

4f456f9

Update comments

56380f6

jalengg marked this pull request as ready for review February 14, 2026 23:43

jalengg marked this pull request as draft February 14, 2026 23:43

Remove non-HALO README changes

d2b8da3

Remove README.rst changes that only documented CorGAN, not HALO. This PR should focus solely on HALO implementation.

jalengg force-pushed the halo-pr-528 branch from a4ff537 to d2b8da3 Compare February 16, 2026 05:03

jalengg added 5 commits February 16, 2026 01:10

Create HALO Colab notebook structure with headers

97050b8

Add setup and installation cells to HALO notebook

58ef738

Add README documentation for HALO Colab notebook

b1458fe

Fix installation cell to detect pip failures

702d65c

- Replace `!pip install` with subprocess.run() for error checking - Show clear error message if installation fails - Raise RuntimeError to stop notebook execution on failure Fixes #1

jalengg commented Feb 16, 2026

View reviewed changes

examples/generate_synthetic_mimic3_halo.py

import os

import sys

sys.path.insert(0, '/u/jalenj4/PyHealth')

Copy link

Author

jalengg Feb 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove

jalengg added 3 commits February 16, 2026 04:12

Remove pandas<2 constraint for Python 3.12 compatibility

261f819

Add missing __init__.py to halo_resources module

b8b4c96

Fixes #7

jalengg mentioned this pull request Feb 16, 2026

[halo-pr-prep] Overview: HALO PyHealth Integration Tasks jalengg/PyHealth#17

Open

7 tasks

jalengg added 5 commits February 16, 2026 05:15

Add --no-cache-dir to pip install for latest code

564cf0a

Ensures Colab users always get the latest version from GitHub without using cached packages. Critical for picking up recent fixes like the halo_resources __init__.py. Fixes #18

Fix path concatenation bug in HALO_MIMIC3Dataset

8002123

Use os.path.join() instead of string concatenation to properly handle directory paths with or without trailing slashes. Fixes #19

Add install timestamp to Colab notebook success message

2418788

jalengg and others added 6 commits February 17, 2026 02:56

Add last-updated timestamp to Colab notebook header

f1ceb35

Timestamp reflects when notebook was last modified so users can verify they are running the correct version. Reverts dynamic install-time timestamp in favor of this static header approach.

Use human-readable timestamp format in notebook header

663dbb8

added trailing slash

200e693

added trailing slash

76de88e

format string error

5ec2c42

jalengg commented Feb 18, 2026

View reviewed changes

jalengg changed the title ~~Halo pr 528~~ HALO Implementation into PyHealth + Colab Notebook Feb 18, 2026

shiitavie and others added 4 commits February 18, 2026 17:04

remove assertion (issue #23)

bb5de81

fix: complete merge - add missing processor files from upstream/master

4040422

chore: merge upstream/master, resolve processor/model conflicts

0c9e973

feat: add halo_generation task function (HaloGenerationMIMIC3/4)

a864781

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

HALO Implementation into PyHealth + Colab Notebook#812

HALO Implementation into PyHealth + Colab Notebook#812
jalengg wants to merge 35 commits intosunlabuiuc:masterfrom
jalengg:halo-pr-528

jalengg commented Feb 4, 2026

Uh oh!

jalengg Feb 16, 2026

Uh oh!

jalengg Feb 18, 2026

Uh oh!

jalengg Feb 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Comments

Conversation

jalengg commented Feb 4, 2026

Uh oh!

jalengg Feb 16, 2026

Choose a reason for hiding this comment

Uh oh!

jalengg Feb 18, 2026

Choose a reason for hiding this comment

Uh oh!

jalengg Feb 18, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants