Directional Textual Inversion for Personalized Text-to-Image Generation
Kunhee Kim1* · NaHyeon Park1* · Kibeom Hong2 · Hyunjung Shim1
1KAIST · 2Sookmyung Women's University
Textual Inversion (TI) is efficient but often suffers from embedding norm inflation—learned tokens drift to out-of-distribution magnitudes, degrading prompt fidelity in pre-norm Transformers.
Directional Textual Inversion (DTI) addresses this by:
- Decoupling magnitude and direction: Fixing embedding magnitude to an in-distribution scale
- Spherical optimization: Optimizing only the direction on the unit hypersphere via Riemannian SGD
- vMF prior: Using a von Mises-Fisher prior for semantic coherence
DTI achieves superior text fidelity and subject preservation on Stable Diffusion XL (SDXL), SANA, and Wan2.1-T2V-1.3B (image-as-1-frame-video setup).
Our implementation is built on top of the HuggingFace diffusers library and is fully compatible with existing Textual Inversion (TI) pipelines.
Requirements: Python 3.9+ · PyTorch 2.0+ · CUDA GPU
git clone https://github.com/kunheek/dti
cd dti
pip install -e .Using [`uv`](https://docs.astral.sh/uv/) (recommended for faster setup)
uv venv python=3.12
source .venv/bin/activate
pip install -e .python exps/ours_sdxl.py -g 0This runs DTI on all DreamBooth subjects. To train on a specific subject:
python exps/ours_sdxl.py -g 0 --instances dogFull training command
accelerate launch --mixed_precision=bf16 --num_processes=1 \
scripts/train_sdxl.py \
--pretrained_model_name_or_path "stabilityai/stable-diffusion-xl-base-1.0" \
--train_data_dir data/dreambooth/dog \
--output_dir output/dti-sdxl/dog \
--placeholder_token "<dog>" \
--initializer_token dog \
--resolution 768 \
--train_batch_size 4 \
--max_train_steps 400 \
--learning_rate 0.01 \
--token_scale mean \
--kappa 1e-4 \
--decompose_scale truepython exps/ours_sana.py -g 0 -m sana1.5_1.6bNote: Our paper reports SANA with learning rate
0.02and1000steps, but later experiments showed better performance with learning rate0.01and500steps. We recommend using the0.01+500combination for SANA.
python exps/ours_wan.py -g 2 -m wan2.1_t2v_1.3bWan uses
Wan-AI/Wan2.1-T2V-1.3B-Diffusers. If you seeValueError: Unrecognized model ..., upgradetransformersanddiffusers:
uv pip install --python .venv/bin/python --upgrade "git+https://github.com/huggingface/diffusers.git@main" "transformers>=4.48" "huggingface-hub>=0.34,<1.0"
python scripts/evaluate.py -e output/dti-sdxl| Parameter | Description | Default |
|---|---|---|
-g |
GPU ID | 0 |
--instances |
Subject names | all |
--max_train_steps |
Training iterations | 400 |
--kappa |
vMF prior strength | 1e-4 |
--learning_rate |
Learning rate | 0.01 |
--token_scale |
Magnitude scale (mean, max, or float) |
mean |
# Standard Textual Inversion
python exps/ti_sdxl.py -g 0
# DCO + DTI
python exps/ours_dco_sdxl.py -g 0data/
├── dreambooth.json
└── dreambooth/
└── subject_name/
├── 00.jpg
├── 01.jpg
└── ...
JSON format:
{
"subject_name": {
"path": "data/dreambooth/subject_name",
"class": "dog",
"initialization": "dog"
}
}Download DreamBooth dataset:
python scripts/download_datasets.pydti/
├── src/dti/ # Core implementation
├── scripts/ # Training & evaluation scripts
├── exps/ # Experiment launchers
└── data/ # Datasets and configs
@inproceedings{kim2026directional,
title={Directional Textual Inversion for Personalized Text-to-Image Generation},
author={Kim, Kunhee and Park, NaHyeon and Hong, Kibeom and Shim, Hyunjung},
booktitle={International Conference on Learning Representations},
year={2026}
}This project is licensed under the MIT License. See LICENSE for details.
