ExpGuard: LLM Content Moderation in Specialized Domains

This repository is the official implementation for our ICLR 2026 paper: ExpGuard: LLM Content Moderation in Specialized Domains.

🔧 Requirements

To install requirements:

conda create -n expguard python=3.12.9
conda activate expguard
pip install -r requirements.txt
pip install flash-attn==2.7.3 --no-build-isolation

⌛ Hardware Requirements

Training

Recommended Setup (Used in Paper): 4 × NVIDIA H200 144GB GPUs
Minimum Requirements (tested): 2 × NVIDIA H200 144GB GPUs
The training process requires significant GPU memory due to the large model size and long sequence length (4096).

Evaluation

A single NVIDIA A6000 or A100 GPU is sufficient for evaluation
Evaluation can be run on smaller GPUs with appropriate batch size adjustments

🤗 Data

Get our ExpGuardMix dataset from Hugging Face Hub.

WildChat, LMSYS-Chat-1M, and Suicide Detection subsets in Aegis 2.0 have been removed from the distributed version of the dataset. Users can access the subsets used in the paper through the Google Drive link. Put the subsets in the data/expguardmix folder.

📁 Project Structure

ExpGuard/
├── data/
│   └── expguardmix/
│       ├── lmsys.jsonl
│       ├── suicide.jsonl
│       └── wildchat.jsonl
├── scripts/
│   ├── train.sh
│   └── safety_eval.sh
├── safety-eval/
│   └── ... (other evaluation scripts)
├── train.py
├── ... (other training scripts)
└── README.md

🔥 Training

To train the model in the paper:

Configure your training parameters in scripts/train.sh. Key parameters include:
- dataset_name: Name of the dataset (default: expguardmix).
- model_type: Type of the model (default: qwen2.5-7b).
- model_name_or_path: Pretrained model name or path (default: Qwen/Qwen2.5-7B).
- per_device_train_batch_size: Batch size per device (default: 4).
- gradient_accumulation_steps: Gradient accumulation steps (default: 2).
- max_seq_len: Maximum sequence length (default: 4096).
- learning_rate: Learning rate (default: 5e-6).
- epochs: Number of training epochs (default: 3).
- attn_implementation: Attention implementation (default: flash_attention_2). Set to sdpa or eager if flash attention is not available.
Run the training script:

bash scripts/train.sh

📏 Evaluation

To evaluate model safety:

Configure evaluation parameters in scripts/safety_eval.sh.
Set the model path (model_path) to the local path where you have downloaded ExpGuard's model weights. If you are evaluating a model available on HuggingFace Hub, you can leave this empty, and the script will download it automatically.
Set task_type to prompt or response for prompt and response evaluation, respectively.
Set model_name to the model you wish to evaluate. The default is our model ExpGuard. The baseline model names can be found in safety-eval/src/classifier_models/loader.py (e.g., OpenAIModeration, PerspectiveAPI, LlamaGuard3, WildGuard, etc.). You need API keys to run API-based guardrails. These should be set as environment variables (OPENAI_API_KEY, PERSPECTIVE_API_KEY, AZURE_CONTENT_SAFETY_KEY, AZURE_CONTENT_SAFETY_ENDPOINT) in scripts/safety_eval.sh. All API tools are free, except Azure, which is commercial but offers some free credit for new accounts.
Run the evaluation script:

bash scripts/safety_eval.sh

The results will be saved in the ./classfication_results directory.

🔢 Results

The following are the reported scores of ExpGuard:

Model	Public Prompt Harm Avg. F1	Public Response Harm Avg. F1	ExpGuardTest Prompt Harm Total F1	ExpGuardTest Response Harm Total F1
ExpGuard	85.7	78.5	93.3	92.7

To reproduce the scores in the paper, we use early stopping at the end of epoch 2. Please refer to the Appendix for more hyperparameter details.

Note: The reported values may not be perfectly reproducible and might vary slightly due to factors such as the vLLM inference library, CUDA versions, specific GPU devices used, and other stochastic elements in the evaluation process.

Acknowledgements

The evaluation code is adapted from safety-eval. Props to the Ai2 team 👍

Citation

@inproceedings{
  choi2026expguard,
  title={ExpGuard: {LLM} Content Moderation in Specialized Domains},
  author={Choi, Minseok and Kim, Dongjin and Yang, Seungbin and Kim, Subin and Kwak, Youngjun and Oh, Juyoung and Choo, Jaegul and Son, Jungmin},
  booktitle={The Fourteenth International Conference on Learning Representations},
  year={2026},
  url={https://openreview.net/forum?id=t5cYJlV6aJ}
}

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
safety-eval		safety-eval
scripts		scripts
templates		templates
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
dataset.py		dataset.py
model.py		model.py
requirements.txt		requirements.txt
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ExpGuard: LLM Content Moderation in Specialized Domains

🔧 Requirements

⌛ Hardware Requirements

Training

Evaluation

🤗 Data

📁 Project Structure

🔥 Training

📏 Evaluation

🔢 Results

Acknowledgements

Citation

About

Uh oh!

Releases

Packages

Languages

License

brightjade/ExpGuard

Folders and files

Latest commit

History

Repository files navigation

ExpGuard: LLM Content Moderation in Specialized Domains

🔧 Requirements

⌛ Hardware Requirements

Training

Evaluation

🤗 Data

📁 Project Structure

🔥 Training

📏 Evaluation

🔢 Results

Acknowledgements

Citation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages